Skip to content

Conversation

ram4444
Copy link
Contributor

@ram4444 ram4444 commented Jul 26, 2025

What this PR does / why we need it:

Adding out of the box support to TrainJob and providing an example

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this example to examples/v1beta1/kubeflow-trainer/trainjob-pytorch.yaml

# min: 1
# max: 5
trialTemplate:
primaryContainerName: pytorch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be node, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite understand your question
This is copy from pytorchjob-mnist.yaml and I keep it the same

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done it. Please state in somewhere in the doc that this field should be referenced to the ClusterTrainingRuntime. Since I thought it is a custom defined name of the container by the user

Comment on lines 42 to 44
- name: arg
description: An additional argument for the training model
reference: arg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I will removed it.

Should I create another PR or put everything on top of these 2 commits?

BTW Do you have any docker image for pytorch-deepspeed_train_t5 so that I can create another example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I create another PR or put everything on top of these 2 commits?

You can make the appropriate changes in this PR.

BTW Do you have any docker image for pytorch-deepspeed_train_t5 so that I can create another example?

We don't have docker image, since we create it using the Kubeflow SDK: https://github.com/kubeflow/trainer/blob/master/examples/deepspeed/text-summarization/T5-Fine-Tuning.ipynb

@andreyvelich
Copy link
Member

cc @kramaranya @szaher

@juliusvonkohout
Copy link
Member

/ok-to-test

once that is in i can merge kubeflow/manifests#3199

@andreyvelich
Copy link
Member

@ram4444 Please can you update the controller code according to this: #2560 (review) ?

@ram4444
Copy link
Contributor Author

ram4444 commented Aug 5, 2025

@ram4444 Please can you update the controller code according to this: #2560 (review) ?

Hi,

Is it simply adding entry to
pkg/apis/controller/experiments/v1beta1/constants.go

KubeflowJobKinds = map[string]bool{
	"TFJob":      true,
	"PyTorchJob": true,
	"XGBoostJob": true,
	"MPIJob":     true,
	"TrainJob":   true,
}

?

@andreyvelich
Copy link
Member

@ram4444 Please can you update the controller code according to this: #2560 (review) ?

Hi,

Is it simply adding entry to pkg/apis/controller/experiments/v1beta1/constants.go

KubeflowJobKinds = map[string]bool{
	"TFJob":      true,
	"PyTorchJob": true,
	"XGBoostJob": true,
	"MPIJob":     true,
	"TrainJob":   true,
}

?

No, you have to update other places as I mentioned in this comment: #2560 (review)

@ram4444
Copy link
Contributor Author

ram4444 commented Aug 6, 2025

@ram4444 Please can you update the controller code according to this: #2560 (review) ?

Hi,
Is it simply adding entry to pkg/apis/controller/experiments/v1beta1/constants.go

KubeflowJobKinds = map[string]bool{
	"TFJob":      true,
	"PyTorchJob": true,
	"XGBoostJob": true,
	"MPIJob":     true,
	"TrainJob":   true,
}

?

No, you have to update other places as I mentioned in this comment: #2560 (review)

Still not get a clear idea of it. Could you explain more?

@andreyvelich
Copy link
Member

Still not get a clear idea of it. Could you explain more?

Could you read this doc which explains how CRDs within Katib Trial work: https://www.kubeflow.org/docs/components/katib/user-guides/trial-template/#use-crds-with-trial-template ?
We should update the default values for Success and Failure conditions and
PrimaryPodLabels which represent MASTER training pod.

@ram4444
Copy link
Contributor Author

ram4444 commented Aug 6, 2025

I have go through the code and doc, but I am not understand what is going to be added/changed in the lines (110&117) specified.

Am I go to add an else condition to TrainJob (but I am not sure what is going to be the Default Fail/SuccessCondition/PrimaryPodLabels)?

func (e *Experiment) setDefaultTrialTemplate() {
	t := e.Spec.TrialTemplate

	// Set default values for Job and Kubeflow Training Job if TrialSpec is not nil
	if t != nil && t.TrialSource.TrialSpec != nil {
		jobKind := t.TrialSource.TrialSpec.GetKind()
		if  == consts.JobKindJob {
			if t.SuccessCondition == "" {
				t.SuccessCondition = DefaultJobSuccessCondition
			}
			if t.FailureCondition == "" {
				t.FailureCondition = DefaultJobFailureCondition
			}
		} else if _, ok := KubeflowJobKinds[jobKind]; ok {
			if t.SuccessCondition == "" {
				t.SuccessCondition = DefaultKubeflowJobSuccessCondition
			}
			if t.FailureCondition == "" {
				t.FailureCondition = DefaultKubeflowJobFailureCondition
			}
			// For Kubeflow Job also set default PrimaryPodLabels
			if len(t.PrimaryPodLabels) == 0 {
				t.PrimaryPodLabels = DefaultKubeflowJobPrimaryPodLabels
			}
		} else if jobKind == "TrainJob" {
            if t.SuccessCondition == "" {
                //t.SuccessCondition = DefaultKubeflowJobSuccessCondition
				// A different Default value for success condition
            }
            if t.FailureCondition == "" {
                //t.FailureCondition = DefaultKubeflowJobFailureCondition
				// A different Default value for failure condition
            }
            if t.PrimaryPodLabels == nil {
                //t.PrimaryPodLabels = DefaultKubeflowJobPrimaryPodLabels
				// A different Default value for PrimaryPodLabels
            }
        }
	}
	e.Spec.TrialTemplate = t
}

@andreyvelich
Copy link
Member

DefaultKubeflowJobSuccessCondition

Here are the values that we should use for TrainJob:

	DefaultTrainJobSuccessCondition = "status.conditions.#(type==\"Complete\")#|#(status==\"True\")#"
	DefaultTrainJobFailureCondition = "status.conditions.#(type==\"Failed\")#|#(status==\"True\")#"
        DefaultTrainJobPrimaryPodLabels = map[string]string{"jobset.sigs.k8s.io/replicatedjob-name": "node"}

@ram4444
Copy link
Contributor Author

ram4444 commented Aug 7, 2025

Please consider whether we should to put them in const.go to be consistent. (or I could proceed as mentioned)

@andreyvelich
Copy link
Member

Please consider whether we should to put them in const.go to be consistent. (or I could proceed as mentioned)

Yes, please add them into constants.go

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ram4444 Would you be able to verify that this integration works on your local Kind cluster?
Since we don't have E2Es for TrainJob, it would be nice to verify it.
cc @Electronic-Waste @kubeflow/kubeflow-trainer-team @astefanutti

t.PrimaryPodLabels = DefaultKubeflowJobPrimaryPodLabels
}
}
} else if jobKind == "TrainJob" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add TrainJob to the KubeflowJobKinds list as well please ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function will turn to something like

func (e *Experiment) setDefaultTrialTemplate() {
	t := e.Spec.TrialTemplate

	// Set default values for Job and Kubeflow Training Job if TrialSpec is not nil
	if t != nil && t.TrialSource.TrialSpec != nil {
		jobKind := t.TrialSource.TrialSpec.GetKind()
		if jobKind == consts.JobKindJob {
			if t.SuccessCondition == "" {
				t.SuccessCondition = DefaultJobSuccessCondition
			}
			if t.FailureCondition == "" {
				t.FailureCondition = DefaultJobFailureCondition
			}
		} else if _, ok := KubeflowJobKinds[jobKind]; ok {
			if t.SuccessCondition == "" {
				if jobKind == "TrainJob"
					t.SuccessCondition = DefaultTrainJobSuccessCondition
				else
					t.SuccessCondition = DefaultKubeflowJobSuccessCondition		
			}
			if t.FailureCondition == "" {
				if jobKind == "TrainJob"
					t.FailureCondition = DefaultTrainJobFailureCondition
				else
					t.FailureCondition = DefaultKubeflowJobFailureCondition
			}
			// For Kubeflow Job also set default PrimaryPodLabels
			if len(t.PrimaryPodLabels) == 0 {
				if jobKind == "TrainJob"
					t.PrimaryPodLabels = DefaultTrainJobPrimaryPodLabels
				else
					t.PrimaryPodLabels = DefaultKubeflowJobPrimaryPodLabels
			}
		} 
	}
	e.Spec.TrialTemplate = t
}

To test it in my local k8s
since the katib controller is installed by pulling image in kustomize,
please let me know any command to build the image from scratch

images:
  - name: ghcr.io/kubeflow/katib/katib-controller
    newName: ghcr.io/kubeflow/katib/katib-controller
    newTag: v0.18.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can build controller as follows:

docker build . -f cmd/katib-controller/v1beta1/Dockerfile -t <MY_IMAGE>

@ram4444
Copy link
Contributor Author

ram4444 commented Aug 12, 2025

@ram4444 Would you be able to verify that this integration works on your local Kind cluster?

Since we don't have E2Es for TrainJob, it would be nice to verify it.

cc @Electronic-Waste @kubeflow/kubeflow-trainer-team @astefanutti

Hi,

I am sorry to inform that due to hardware issue (my 10-years-old homelab which is the only server capable to run the whole Kubeflow is down), I am not able to test it at this moment.😩

I could commit my latest code to my own repo first. Please let me know if it is ok to proceed.

Ram

@andreyvelich
Copy link
Member

I am sorry to inform that due to hardware issue (my 10-years-old homelab which is the only server capable to run the whole Kubeflow is down), I am not able to test it at this moment.

Sure, no problem, I can try to deploy it from my machine. Please push your latest changes.

Btw, you don't need to deploy the entire Kubeflow Platform, you can just deploy Katib + Trainer control plane to verify it.

@juliusvonkohout
Copy link
Member

@ram4444 Would you be able to verify that this integration works on your local Kind cluster?
Since we don't have E2Es for TrainJob, it would be nice to verify it.
cc @Electronic-Waste @kubeflow/kubeflow-trainer-team @astefanutti

Hi,

I am sorry to inform that due to hardware issue (my 10-years-old homelab which is the only server capable to run the whole Kubeflow is down), I am not able to test it at this moment.😩

I could commit my latest code to my own repo first. Please let me know if it is ok to proceed.

Ram

Yo can run it with 4 GB or so. just remove what you do not need. See https://github.com/kubeflow/manifests#kubeflow-components-versions

@ram4444
Copy link
Contributor Author

ram4444 commented Aug 13, 2025

@andreyvelich @juliusvonkohout

I have committed the latest changes to my repo and I have build the image of katib controller to my dockerhub repo

cmd of my deployment
kustomize build applications/katib/upstream/installs/katib-with-kubeflow | kubectl apply -f -

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow
resources:
  - ../katib-cert-manager
  # Kubeflow Katib components.
  - kubeflow-katib-roles.yaml
  - ui-virtual-service.yaml
  - istio-authorizationpolicy.yaml
images:
  #- name: ghcr.io/kubeflow/katib/katib-controller
  #  newName: ghcr.io/kubeflow/katib/katib-controller
  #  newTag: v0.18.0
  - name: dionysbiz/katib-controller
    newName: dionysbiz/katib-controller
    newTag: latest
  - name: ghcr.io/kubeflow/katib/katib-db-manager
    newName: ghcr.io/kubeflow/katib/katib-db-manager
    newTag: v0.18.0
  - name: ghcr.io/kubeflow/katib/katib-ui
    newName: ghcr.io/kubeflow/katib/katib-ui
    newTag: v0.18.0

The training operator produce the following error log

{"level":"info","ts":"2025-08-13T20:05:37.062503964Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:198","msg":"Probe endpoints are configured on healthz and readyz"}
{"level":"error","ts":"2025-08-13T20:05:37.064450881Z","logger":"setup","caller":"training-operator.v2alpha1/main.go:145","msg":"Could not initialize runtimes","error":"initializing runtime "TrainingRuntime.trainer.kubeflow.org": setting index on TrainingRuntime for TrainJob: no matches for kind "TrainJob" in version "trainer.kubeflow.org/v2alpha1"","stacktrace":"main.main\n\t/workspace/cmd/training-operator.v2alpha1/main.go:145\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:272"}
stream closed EOF for kubeflow-system/training-operator-v2-7b9949cc86-cq2zm (manager)

@ram4444
Copy link
Contributor Author

ram4444 commented Aug 21, 2025

should I replace the image with my own latest build or just use the original ghcr one?

@andreyvelich
Copy link
Member

should I replace the image with my own latest build or just use the original ghcr one?

Yes, please use your image: dionysbiz/katib-controller:latest

@ram4444
Copy link
Contributor Author

ram4444 commented Aug 21, 2025

Trainjob

Name:         torch-distributed-example-526stpjk
Namespace:    kubeflow
Labels:       <none>
Annotations:  <none>
API Version:  trainer.kubeflow.org/v1alpha1
Kind:         TrainJob
Metadata:
  Creation Timestamp:  2025-08-21T02:48:55Z
  Generation:          1
  Owner References:
    API Version:           kubeflow.org/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Trial
    Name:                  torch-distributed-example-526stpjk
    UID:                   e2370d25-c5b1-4352-8f66-256b0350ff64
  Resource Version:        6966
  UID:                     8e5e65d5-89ad-4aa1-96f8-83140ac504ad
Spec:
  Managed By:  trainer.kubeflow.org/trainjob-controller
  Runtime Ref:
    API Group:  trainer.kubeflow.org
    Kind:       ClusterTrainingRuntime
    Name:       torch-distributed
  Suspend:      false
  Trainer:
    Command:
      python3
      /opt/pytorch-mnist/mnist.py
      --epochs=1
      --lr=0.031100969364658455
      --momentum=0.5426043366290967
    Image:      ghcr.io/kubeflow/katib/pytorch-mnist-cpu:latest
    Num Nodes:  2
Status:
  Conditions:
    Last Transition Time:  2025-08-21T03:08:46Z
    Message:               jobset completed successfully
    Reason:                AllJobsCompleted
    Status:                True
    Type:                  Complete
Events:                    <none>

I still cannot find the metrics collector and it is showing the same error log in the katib controller log after the job has finished
in addition, I cannot list out the experiments crds
please also notes the trainjob is using ClusterTrainingRuntime but it was TrainingRuntime

@andreyvelich
Copy link
Member

I still cannot find the metrics collector and it is showing the same error log in the katib controller log after the job has finished

Can you show full log from Katib controller ?

in addition, I cannot list out the experiments crds

What do you mean ? I can see that TrainJob that you showed uses Katib's Trial, so Experiment should be there.

@ram4444
Copy link
Contributor Author

ram4444 commented Aug 21, 2025

I ve just re-run the experiment and here is the log when it starts

 {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}                                                                                                                                                                                                                                                                                                       │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"addCount":3}                                                                                                                                                                                                                                                                                                                                                             │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                                              │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-suggestion-client","msg":"Creating Suggestion","experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"namespace":"kubeflow","name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                  │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-suggestion-client","msg":"Suggestion created","experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"namespace":"kubeflow","name":"torch-distributed-example"}                                                                                                                                                                                                                                                                                                           │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}                                                                                                                                                                                                                                                                                                       │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"addCount":3}                                                                                                                                                                                                                                                                                                                                                             │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                                              │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-suggestion-client","msg":"Creating Suggestion","experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"namespace":"kubeflow","name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                  │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"suggestion-controller","msg":"Creating Service","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example-random"}                                                                                                                                                                                                                                                                                                                                    │
│ {"level":"error","ts":"2025-08-21T21:37:50Z","logger":"experiment-suggestion-client","msg":"CreateSuggestion failed","experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"instance":"torch-distributed-example","error":"suggestions.kubeflow.org \"torch-distributed-example\" already exists","stacktrace":"github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/suggestion.(*General).GetOrCreateSuggestion\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/suggestion/suggestion.go:61\ngithub.com/ku │
│ {"level":"error","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion failed","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3,"error":"suggestions.kubeflow.org \"torch-distributed-example\" already exists","stacktrace":"github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileSuggestions\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go │
│ {"level":"error","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Get suggestions error","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"error":"suggestions.kubeflow.org \"torch-distributed-example\" already exists","stacktrace":"github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrials\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:350\ngithub.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*Reconc │
│ {"level":"error","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Create trials error","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"error":"suggestions.kubeflow.org \"torch-distributed-example\" already exists","stacktrace":"github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileTrials\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:334\ngithub.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*Recon │
│ {"level":"error","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Reconcile experiment error","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"error":"suggestions.kubeflow.org \"torch-distributed-example\" already exists","stacktrace":"github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).Reconcile\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:239\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controlle │
│ {"level":"error","ts":"2025-08-21T21:37:50Z","msg":"Reconciler error","controller":"experiment-controller","object":{"name":"torch-distributed-example","namespace":"kubeflow"},"namespace":"kubeflow","name":"torch-distributed-example","reconcileID":"9df30035-7a77-48ee-954c-f194366e27c8","error":"suggestions.kubeflow.org \"torch-distributed-example\" already exists","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/c │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}                                                                                                                                                                                                                                                                                                       │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"addCount":3}                                                                                                                                                                                                                                                                                                                                                             │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                                              │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}                                                                                                                                                                                                                                                                                                       │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"addCount":3}                                                                                                                                                                                                                                                                                                                                                             │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                                              │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"suggestion-controller","msg":"Creating Deployment","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example-random"}                                                                                                                                                                                                                                                                                                                                 │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}                                                                                                                                                                                                                                                                                                       │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"addCount":3}                                                                                                                                                                                                                                                                                                                                                             │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                                              │
│ {"level":"info","ts":"2025-08-21T21:37:50Z","logger":"suggestion-controller","msg":"Update suggestion instance status failed, reconciler requeued","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"},"err":"Operation cannot be fulfilled on suggestions.kubeflow.org \"torch-distributed-example\": the object has been modified; please apply your changes to the latest version and try again"}                                                                                                                                    │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"suggestion-client","msg":"Algorithm settings are validated","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"}}                                                                                                                                                                                                                                                                                                                                                                  │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"suggestion-controller","msg":"Sync assignments","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"},"Suggestion Requests":3,"Suggestion Count":0}                                                                                                                                                                                                                                                                                                                                 │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"suggestion-client","msg":"Getting suggestions","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"},"endpoint":"torch-distributed-example-random.kubeflow:6789","Number of current request parameters":3,"Number of response parameters":3}                                                                                                                                                                                                                                        │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}                                                                                                                                                                                                                                                                                                       │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"addCount":3}                                                                                                                                                                                                                                                                                                                                                             │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"name":"torch-distributed-example","Suggestion Requests":3}                                                                                                                                                                                                                                                                                                              │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"suggestion-controller","msg":"Sync assignments","Suggestion":{"name":"torch-distributed-example","namespace":"kubeflow"},"Suggestion Requests":3,"Suggestion Count":3}                                                                                                                                                                                                                                                                                                                                 │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"experiment-controller","msg":"Created Trials","Experiment":{"name":"torch-distributed-example","namespace":"kubeflow"},"trialNames":["torch-distributed-example-x4x9j58j","torch-distributed-example-wqktfdwk","torch-distributed-example-jtgz2bhd"]}                                                                                                                                                                                                                                                  │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"trial-controller","msg":"Creating Job","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"},"kind":"TrainJob","name":"torch-distributed-example-x4x9j58j"}                                                                                                                                                                                                                                                                                                                     │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"trial-controller","msg":"Trial status changed to Running","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}                                                                                                                                                                                                                                                                                                                                                                │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"trial-controller","msg":"Creating Job","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"},"kind":"TrainJob","name":"torch-distributed-example-wqktfdwk"}                                                                                                                                                                                                                                                                                                                     │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"trial-controller","msg":"Trial status changed to Running","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}                                                                                                                                                                                                                                                                                                                                                                │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"trial-controller","msg":"Creating Job","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"},"kind":"TrainJob","name":"torch-distributed-example-jtgz2bhd"}                                                                                                                                                                                                                                                                                                                     │
│ {"level":"info","ts":"2025-08-21T21:38:23Z","logger":"trial-controller","msg":"Trial status changed to Running","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}    

when it comes to the end

{"level":"info","ts":"2025-08-21T22:01:16Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:17Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:17Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:17Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:18Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:18Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:18Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:19Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:19Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:19Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:20Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:20Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:20Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:21Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:21Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:21Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:22Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:22Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:22Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:23Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:23Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:23Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:24Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:24Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:24Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:25Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:25Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:25Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:26Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:26Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:26Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:27Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:27Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:27Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:28Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:28Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:28Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:29Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:29Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:29Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:30Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:30Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:30Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:31Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:31Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:31Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:32Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:32Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:32Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:33Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:33Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:33Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:34Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:34Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:34Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:35Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:35Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:35Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:36Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:36Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:36Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:37Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:37Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:37Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:38Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:38Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:38Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:39Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:39Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:39Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:40Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:40Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:40Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:41Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:41Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:41Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:42Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:42Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:42Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:43Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:43Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:43Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:44Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:44Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:44Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:45Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:45Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:45Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:46Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:46Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:46Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:47Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:47Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:47Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:48Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:48Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:48Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:49Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:49Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:49Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:50Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:50Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:50Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:51Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:51Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:51Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:52Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:52Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:53Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:54Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:54Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:54Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:55Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:55Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:55Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:56Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:56Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:56Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:57Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:57Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:57Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:58Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:58Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:58Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:59Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-wqktfdwk","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:59Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-x4x9j58j","namespace":"kubeflow"}}
{"level":"info","ts":"2025-08-21T22:01:59Z","logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":{"name":"torch-distributed-example-jtgz2bhd","namespace":"kubeflow"}}

@andreyvelich
Copy link
Member

We also should add batch.kubernetes.io/job-completion-index: 0 to the primaryPodLabels const.

@ram4444
Copy link
Contributor Author

ram4444 commented Aug 22, 2025

so the constant.go should be updated to

DefaultTrainJobPrimaryPodLabels = map[string]string{"jobset.sigs.k8s.io/replicatedjob-name": "node", "batch.kubernetes.io/job-completion-index": "0"}

Just rebuild the image and add the label but still got the same result

Name:             torch-distributed-example-c599nw8t-node-0-0-hc855
Namespace:        kubeflow
Priority:         0
Service Account:  default
Node:             kind-control-plane/172.19.0.2
Start Time:       Fri, 22 Aug 2025 02:10:31 +0100
Labels:           batch.kubernetes.io/controller-uid=e71dfdf7-2119-4a57-ac37-690db8bb23fb
                  batch.kubernetes.io/job-completion-index=0
                  batch.kubernetes.io/job-name=torch-distributed-example-c599nw8t-node-0
                  controller-uid=e71dfdf7-2119-4a57-ac37-690db8bb23fb
                  job-name=torch-distributed-example-c599nw8t-node-0
                  jobset.sigs.k8s.io/global-replicas=1
                  jobset.sigs.k8s.io/job-global-index=0
                  jobset.sigs.k8s.io/job-index=0
                  jobset.sigs.k8s.io/job-key=1e24e5259edfdec1fcb7914899d643c3a5abe72a
                  jobset.sigs.k8s.io/jobset-name=torch-distributed-example-c599nw8t
                  jobset.sigs.k8s.io/replicatedjob-name=node
                  jobset.sigs.k8s.io/replicatedjob-replicas=1
                  jobset.sigs.k8s.io/restart-attempt=0
Annotations:      batch.kubernetes.io/job-completion-index: 0
                  jobset.sigs.k8s.io/global-replicas: 1
                  jobset.sigs.k8s.io/job-global-index: 0
                  jobset.sigs.k8s.io/job-index: 0
                  jobset.sigs.k8s.io/job-key: 1e24e5259edfdec1fcb7914899d643c3a5abe72a
                  jobset.sigs.k8s.io/jobset-name: torch-distributed-example-c599nw8t
                  jobset.sigs.k8s.io/replicatedjob-name: node
                  jobset.sigs.k8s.io/replicatedjob-replicas: 1
                  jobset.sigs.k8s.io/restart-attempt: 0

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Aug 22, 2025
@andreyvelich
Copy link
Member

Found an issue.
As described in this doc, you must give Katib controller permission to all nested resources that Trial creates: https://www.kubeflow.org/docs/components/katib/user-guides/trial-template/#use-crds-with-trial-template:~:text=Modify%20Katib%20controller%20ClusterRole%E2%80%99s%20rules%20with%20the%20new%20rule%20to%20give%20Katib%20access%20to%20all%20resources%20that%20are%20created%20by%20the%20Trial.%20To%20know%20more%20about%20ClusterRole%2C%20check%20the%20Kubernetes%20guide.

Katib needs to understand whether desired pod belongs to Trial here: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/inject_webhook.go#L286-L290

Now Katib can optimize HPs on TrainJobs 🎉
cc @kubeflow/kubeflow-trainer-team @kramaranya @szaher @astefanutti

$ k get trial -n $NS
NAME                                 TYPE        STATUS   AGE
torch-distributed-example-2pnvtn72   Running     True     22s
torch-distributed-example-4nqzdhvw   Running     True     21s
torch-distributed-example-9v79g85p   Succeeded   True     5m2s
torch-distributed-example-ftw6rv7f   Running     True     28s
torch-distributed-example-jq9jz7mt   Succeeded   True     5m2s
torch-distributed-example-phsqstg5   Succeeded   True     5m2s

$ k get trainjob -n $NS
NAME                                 STATE   AGE
torch-distributed-example-2pnvtn72           25s
torch-distributed-example-4nqzdhvw           24s
torch-distributed-example-ftw6rv7f           31s

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@google-oss-prow google-oss-prow bot added size/M and removed size/L labels Aug 22, 2025
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great contributions @ram4444!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@ram4444
Copy link
Contributor Author

ram4444 commented Aug 22, 2025

@andreyvelich
Thanks for your work. If there is any help I could do please let me know

@andreyvelich
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Aug 23, 2025
@google-oss-prow google-oss-prow bot merged commit c9528e7 into kubeflow:master Aug 23, 2025
80 of 81 checks passed
@juliusvonkohout
Copy link
Member

Just pinge me if you have a release to synchronize.

@kramaranya
Copy link
Contributor

This looks great! This will be especially useful when we migrate Katib to Kubeflow SDK. Thank you for working on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants