Improve bundle and GitRepo status error reporting #4167

weyfonk · 2025-09-26T07:50:35Z

This propagates a number of errors, which can happen in the process of reconciling a resource, to that resource's status:

GitRepo: failure to create a git job
Bundle: failures happening when:
- resolving a Helm chart version
- handling OCI storage manifests
- targeting the bundle
- writing Content resources
- managing values secrets, whether writing or reading them
- adding a finalizer
- computing rollout partitions
- cloning OCI storage or Helm access secrets

Refers to #3617.
This patch is a preliminary step to making the changes referenced in that issue.

Most of its lines come from new unit test cases used to validate error status updates in the cases mentioned above.

A new file in the reconciler package, error_handling.go, contains extracted logic to be reused across reconcilers for easier and more consistent error reporting.

Additional Information

Checklist

I have updated the documentation via a pull request in the
fleet-docs repository.

0xavi0

I think there are a few errors that are maybe too internal to be propagated to the user (like the finalizer one) and we might be missing the retry mechanism from controller-runtime.

I added comments to a few (not all), but basically I think any error coming from an API server call could be a potential candidate to be requeued.

internal/cmd/controller/reconciler/bundle_controller.go

0xavi0

Wondering if we could standardise how we deal with errors across controllers.

I like that the Bundle Controller returns a TerminalError when the error should not be Retryable, we could maybe use that also in the Gitops controller.

We could do that in a separated PR if we need to merge this with any kind of urgency, just thought that it was a great chance for standardising things

0xavi0 · 2025-10-09T12:24:32Z

internal/cmd/controller/reconciler/bundle_controller.go

+			if errors.Is(err, fleetutil.ErrRetryable) {
+				logger.Info(err.Error())
+				return ctrl.Result{RequeueAfter: durations.DefaultRequeueAfter}, nil
+			}
+
+			return ctrl.Result{}, r.updateErrorStatus(ctx, req.NamespacedName, bundleOrig, bundle, err)


I think this file is using this block more than once.

Would it make sense to make it a function?
And.. I think it would make sense to make it a function accessible from all controllers so we can use it as the standard for errors that might (or not) be shown to the user via conditions or retried.

That way, when we detect an error in any controller we do the same.

Yes, great idea, see the new reconciler/error_handling.go :)

0xavi0 · 2025-10-09T12:25:24Z

internal/cmd/controller/reconciler/bundle_controller.go

+// Setting that condition makes the error message visible in the Rancher UI.
+func (r *BundleReconciler) updateErrorStatus(
+	ctx context.Context,
+	req types.NamespacedName,


I wonder why the linter does not complain about req.
It's not used in the function... is it?

Also .... same remark about the other block checking if it's retrayable, etc...
We might want to add this function to a shared package accesible for all controllers.

I see we have the same in the gitops controller using the Accepted condition instead.
Maybe we can have the condition as a parameter?

Would be cool to have like a standard way for dealing with errors.

Nice catch, parameter removed, thanks!
I've added a new function to set an error on a parametrised condition, see reconciler/error_handling.go.

internal/cmd/controller/gitops/reconciler/gitjob_controller.go

When a git job cannot be created, for instance because a Helm secret cannot be found, Fleet now propagates the error to the GitRepo's status message, meaning that looking through gitOps controller logs is no longer necessary in such cases.

Reconciler methods now live together, all placed before functions.

The bundle reconciler must return a failure when the chart version set on a bundle is a non-strict version, as that is a symptom of a failure, or a needed delay, on the HelmOp controller's end to do its job, including resolving version constraints to static versions.

As unlikely as they may seem, failures to process a bundle resources' manifest for OCI storage should also lead to the corresponding error being reported in the bundle's status, for easier troubleshooting.

In such cases, the error is propagated to the bundle status, as it already was. The novelty here is that the reconciler returns that error.

When the bundle reconciler fails to create a Content resource with a bundle resource's manifest, it now propagates that error to the bundle's status, instead of just returning a reconciler error.

When the bundle reconciler fails to create, update or delete a Helm values secret for a bundle, or to extract Helm values from the bundle as a preliminary step to creating such a secret, the corresponding error is now propagated to the bundle status to ease troubleshooting.

When the bundle reconciler fails to load a bundle's Helm values secret, it now propagates the error to the bundle's status to ease troubleshooting.

In case the bundle reconciler is unable to add a finalizer to a new bundle, the reconciler propagates the corresponding error to the bundle status, for easier troubleshooting.

When the bundle reconciler is unable to retrieve an access secret for OCI storage, it now propagates the corresponding error to the bundle status for easier troubleshooting.

When the bundle reconciler fails to compute partitions or update the bundle status based on those partitions, it propagates the corresponding error to the bundle status, for easier troubleshooting.

When the bundle reconciler fails to clone an access secret, it now propagates the corresponding error to the bundle status, for easier troubleshooting.

Individual test cases are now slightly more concise.

Failures to add a finalizer to a bundle are internal to Fleet, and should be temporary. Propagating them to the bundle's status would not help the user much.

Transient errors, related to interaction with the API server when storing bundle deployment manifests, are now logged instead of being propagated to bundle statuses. Errors which cannot be resolved through a retry, though, are still propagated.

The secret involved in that logic is the one containing the bundle's contents. It is not the access secret for the OCI registry.

When the bundle reconciler fails to create, update or delete an options secret, that error comes from the Kubernetes API server, and therefore leads to retries, without the user being able to (nor needing to) do anything to mitigate the error. Therefore, Fleet simply retries in such cases, and logs the error instead of propagating it to the bundle status.

Fleet will requeue and log retryable errors, but propagate non-retriable ones to the bundle's status.

Possible errors, except for controller-runtime framework internals, are all retryable and therefore do not lead to bundle status updates, but rather to logs and requeues.

When the bundle reconciler fails with a non-retryable error, it now returns a nil error to prevent requeues.

A non-retryable error in a bundle reconcile attempt now leads to a `TerminalError` being returned by the reconciler, which prevents retries [1] while being more intuitive than returning nil errors. [1]: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#TypedReconciler

The bundle reconciler now uses a new method to distinguish errors between retryable and non-retryable ones, and to compute reconcile results and errors accordingly, including bundle status updates.

`req` was never actually used in the bundle reconciler's `updateErrorStatus`.

Reconcilers need to be able to report errors following common patterns, including: * detecting whether an error is retryable, and producing a reconcile result and error if so * populating an error in a condition within a resource status This commit implements such common logic, to be expanded (and used across multiple reconcilers) as more patterns emerge.

Setting a successful condition can also be done using that logic.

weyfonk requested a review from a team as a code owner September 26, 2025 07:50

weyfonk changed the title ~~Improve bundle status error reporting~~ Improve bundle and GitRepo status error reporting Sep 26, 2025

weyfonk marked this pull request as draft September 26, 2025 08:10

weyfonk mentioned this pull request Sep 26, 2025

Prevent bundle reconciles triggered by bundle status updates #4171

Merged

weyfonk marked this pull request as ready for review September 26, 2025 13:28

weyfonk added this to Fleet Sep 26, 2025

weyfonk moved this to 👀 In review in Fleet Sep 26, 2025

0xavi0 reviewed Oct 2, 2025

View reviewed changes

weyfonk force-pushed the improve-status-error-reporting branch from c2a03ae to 3c3c617 Compare October 6, 2025 11:42

weyfonk requested a review from 0xavi0 October 6, 2025 11:45

weyfonk force-pushed the improve-status-error-reporting branch from 3c3c617 to cea457b Compare October 7, 2025 10:45

kkaempf assigned weyfonk Oct 9, 2025

0xavi0 reviewed Oct 9, 2025

View reviewed changes

weyfonk force-pushed the improve-status-error-reporting branch from cea457b to 99d35c3 Compare October 9, 2025 15:25

weyfonk mentioned this pull request Oct 10, 2025

Improve error reporting in GitRepo status #4211

Open

2 tasks

0xavi0 approved these changes Oct 10, 2025

View reviewed changes

weyfonk force-pushed the improve-status-error-reporting branch from 99d35c3 to 0164447 Compare October 13, 2025 14:17

kkaempf added this to the v2.13.0 milestone Oct 13, 2025

weyfonk added 12 commits October 13, 2025 17:09

Reorder bundle controller methods and functions

ee3d0a4

Reconciler methods now live together, all placed before functions.

Update bundle status with OCI storage manifest handling errors

38b9f31

As unlikely as they may seem, failures to process a bundle resources' manifest for OCI storage should also lead to the corresponding error being reported in the bundle's status, for easier troubleshooting.

Fail bundle reconcile when bundle targeting fails

4aa5bb1

In such cases, the error is propagated to the bundle status, as it already was. The novelty here is that the reconciler returns that error.

Propagate failures to write Content resources

fbbe88b

When the bundle reconciler fails to create a Content resource with a bundle resource's manifest, it now propagates that error to the bundle's status, instead of just returning a reconciler error.

Propagate values secret load error to bundle status

232380c

When the bundle reconciler fails to load a bundle's Helm values secret, it now propagates the error to the bundle's status to ease troubleshooting.

Propagate failure to add bundle finalizer to bundle status

eca8369

In case the bundle reconciler is unable to add a finalizer to a new bundle, the reconciler propagates the corresponding error to the bundle status, for easier troubleshooting.

Propagate OCI storage access secret error to bundle status

be3ecd9

When the bundle reconciler is unable to retrieve an access secret for OCI storage, it now propagates the corresponding error to the bundle status for easier troubleshooting.

Propagate partitions handling errors to bundle status

4288e01

When the bundle reconciler fails to compute partitions or update the bundle status based on those partitions, it propagates the corresponding error to the bundle status, for easier troubleshooting.

Propagate access secrets cloning errors to bundle status

b27cfeb

When the bundle reconciler fails to clone an access secret, it now propagates the corresponding error to the bundle status, for easier troubleshooting.

weyfonk added 13 commits October 13, 2025 17:09

Isolate bundle status patch checks into separate function

ee5be28

Individual test cases are now slightly more concise.

Skip status update for finalizer errors

9b5694c

Failures to add a finalizer to a bundle are internal to Fleet, and should be temporary. Propagating them to the bundle's status would not help the user much.

Fix error message for OCI reference build failure

c1728d6

The secret involved in that logic is the one containing the bundle's contents. It is not the access secret for the OCI registry.

Distinguish retryable errors when resolving OCI reference

643e6ae

Fleet will requeue and log retryable errors, but propagate non-retriable ones to the bundle's status.

Distinguish retriable errors when cloning secrets

e00432c

Possible errors, except for controller-runtime framework internals, are all retryable and therefore do not lead to bundle status updates, but rather to logs and requeues.

Prevent bundle reconcile retries in non-retryable error cases

1ef2c37

When the bundle reconciler fails with a non-retryable error, it now returns a nil error to prevent requeues.

Factor retryable error checks into separate function

829bcce

The bundle reconciler now uses a new method to distinguish errors between retryable and non-retryable ones, and to compute reconcile results and errors accordingly, including bundle status updates.

Remove unused parameter

ab6889e

`req` was never actually used in the bundle reconciler's `updateErrorStatus`.

Apply new common logic to non-error condition setting

62a3c22

Setting a successful condition can also be done using that logic.

weyfonk force-pushed the improve-status-error-reporting branch from 0164447 to 62a3c22 Compare October 13, 2025 15:09

weyfonk merged commit dbfd59a into rancher:main Oct 14, 2025
35 of 36 checks passed

github-project-automation bot moved this from 👀 In review to ✅ Done in Fleet Oct 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve bundle and GitRepo status error reporting #4167

Improve bundle and GitRepo status error reporting #4167

Uh oh!

weyfonk commented Sep 26, 2025 •

edited

Loading

Uh oh!

0xavi0 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

0xavi0 left a comment

Uh oh!

0xavi0 Oct 9, 2025

Uh oh!

weyfonk Oct 9, 2025

Uh oh!

0xavi0 Oct 9, 2025

Uh oh!

0xavi0 Oct 9, 2025

Uh oh!

weyfonk Oct 9, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve bundle and GitRepo status error reporting #4167

Improve bundle and GitRepo status error reporting #4167

Uh oh!

Conversation

weyfonk commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additional Information

Checklist

Uh oh!

0xavi0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

0xavi0 left a comment

Choose a reason for hiding this comment

Uh oh!

0xavi0 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

weyfonk Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

0xavi0 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

0xavi0 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

weyfonk Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weyfonk commented Sep 26, 2025 •

edited

Loading