Skip to content

Conversation

weyfonk
Copy link
Contributor

@weyfonk weyfonk commented Sep 26, 2025

This propagates a number of errors, which can happen in the process of reconciling a resource, to that resource's status:

  • GitRepo: failure to create a git job
  • Bundle: failures happening when:
    • resolving a Helm chart version
    • handling OCI storage manifests
    • targeting the bundle
    • writing Content resources
    • managing values secrets, whether writing or reading them
    • adding a finalizer
    • computing rollout partitions
    • cloning OCI storage or Helm access secrets

Refers to #3617.
This patch is a preliminary step to making the changes referenced in that issue.

Most of its lines come from new unit test cases used to validate error status updates in the cases mentioned above.

A new file in the reconciler package, error_handling.go, contains extracted logic to be reused across reconcilers for easier and more consistent error reporting.

Additional Information

Checklist

  • I have updated the documentation via a pull request in the
    fleet-docs repository.

@weyfonk weyfonk requested a review from a team as a code owner September 26, 2025 07:50
@weyfonk weyfonk changed the title Improve bundle status error reporting Improve bundle and GitRepo status error reporting Sep 26, 2025
@weyfonk weyfonk marked this pull request as draft September 26, 2025 08:10
@weyfonk weyfonk marked this pull request as ready for review September 26, 2025 13:28
@weyfonk weyfonk added this to Fleet Sep 26, 2025
@weyfonk weyfonk moved this to 👀 In review in Fleet Sep 26, 2025
Copy link
Contributor

@0xavi0 0xavi0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a few errors that are maybe too internal to be propagated to the user (like the finalizer one) and we might be missing the retry mechanism from controller-runtime.

I added comments to a few (not all), but basically I think any error coming from an API server call could be a potential candidate to be requeued.

@weyfonk weyfonk force-pushed the improve-status-error-reporting branch from c2a03ae to 3c3c617 Compare October 6, 2025 11:42
@weyfonk weyfonk requested a review from 0xavi0 October 6, 2025 11:45
@weyfonk weyfonk force-pushed the improve-status-error-reporting branch from 3c3c617 to cea457b Compare October 7, 2025 10:45
Copy link
Contributor

@0xavi0 0xavi0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could standardise how we deal with errors across controllers.

I like that the Bundle Controller returns a TerminalError when the error should not be Retryable, we could maybe use that also in the Gitops controller.

We could do that in a separated PR if we need to merge this with any kind of urgency, just thought that it was a great chance for standardising things

Comment on lines 292 to 297
if errors.Is(err, fleetutil.ErrRetryable) {
logger.Info(err.Error())
return ctrl.Result{RequeueAfter: durations.DefaultRequeueAfter}, nil
}

return ctrl.Result{}, r.updateErrorStatus(ctx, req.NamespacedName, bundleOrig, bundle, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file is using this block more than once.

Would it make sense to make it a function?
And.. I think it would make sense to make it a function accessible from all controllers so we can use it as the standard for errors that might (or not) be shown to the user via conditions or retried.

That way, when we detect an error in any controller we do the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great idea, see the new reconciler/error_handling.go :)

// Setting that condition makes the error message visible in the Rancher UI.
func (r *BundleReconciler) updateErrorStatus(
ctx context.Context,
req types.NamespacedName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why the linter does not complain about req.
It's not used in the function... is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also .... same remark about the other block checking if it's retrayable, etc...
We might want to add this function to a shared package accesible for all controllers.

I see we have the same in the gitops controller using the Accepted condition instead.
Maybe we can have the condition as a parameter?

Would be cool to have like a standard way for dealing with errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, parameter removed, thanks!
I've added a new function to set an error on a parametrised condition, see reconciler/error_handling.go.

@weyfonk weyfonk force-pushed the improve-status-error-reporting branch from cea457b to 99d35c3 Compare October 9, 2025 15:25
@weyfonk weyfonk force-pushed the improve-status-error-reporting branch from 99d35c3 to 0164447 Compare October 13, 2025 14:17
@kkaempf kkaempf added this to the v2.13.0 milestone Oct 13, 2025
When a git job cannot be created, for instance because a Helm secret
cannot be found, Fleet now propagates the error to the GitRepo's status
message, meaning that looking through gitOps controller logs is no
longer necessary in such cases.
Reconciler methods now live together, all placed before functions.
The bundle reconciler must return a failure when the chart version set
on a bundle is a non-strict version, as that is a symptom of a failure,
or a needed delay, on the HelmOp controller's end to do its job,
including resolving version constraints to static versions.
As unlikely as they may seem, failures to process a bundle resources'
manifest for OCI storage should also lead to the corresponding error
being reported in the bundle's status, for easier troubleshooting.
In such cases, the error is propagated to the bundle status, as it
already was. The novelty here is that the reconciler returns that error.
When the bundle reconciler fails to create a Content resource with a
bundle resource's manifest, it now propagates that error to the bundle's
status, instead of just returning a reconciler error.
When the bundle reconciler fails to create, update or delete a Helm
values secret for a bundle, or to extract Helm values from the bundle as
a preliminary step to creating such a secret, the corresponding error is
now propagated to the bundle status to ease troubleshooting.
When the bundle reconciler fails to load a bundle's Helm values secret,
it now propagates the error to the bundle's status to ease
troubleshooting.
In case the bundle reconciler is unable to add a finalizer to a new
bundle, the reconciler propagates the corresponding error to the bundle
status, for easier troubleshooting.
When the bundle reconciler is unable to retrieve an access secret for
OCI storage, it now propagates the corresponding error to the bundle
status for easier troubleshooting.
When the bundle reconciler fails to compute partitions or update the
bundle status based on those partitions, it propagates the corresponding
error to the bundle status, for easier troubleshooting.
When the bundle reconciler fails to clone an access secret, it now
propagates the corresponding error to the bundle status, for easier
troubleshooting.
Individual test cases are now slightly more concise.
Failures to add a finalizer to a bundle are internal to Fleet, and
should be temporary. Propagating them to the bundle's status would not
help the user much.
Transient errors, related to interaction with the API server when storing
bundle deployment manifests, are now logged instead of being propagated
to bundle statuses. Errors which cannot be resolved through a retry,
though, are still propagated.
The secret involved in that logic is the one containing the bundle's
contents. It is not the access secret for the OCI registry.
When the bundle reconciler fails to create, update or delete an options
secret, that error comes from the Kubernetes API server, and therefore
leads to retries, without the user being able to (nor needing to) do
anything to mitigate the error.
Therefore, Fleet simply retries in such cases, and logs the error
instead of propagating it to the bundle status.
Fleet will requeue and log retryable errors, but propagate non-retriable
ones to the bundle's status.
Possible errors, except for controller-runtime framework internals, are
all retryable and therefore do not lead to bundle status updates, but
rather to logs and requeues.
When the bundle reconciler fails with a non-retryable error, it now
returns a nil error to prevent requeues.
A non-retryable error in a bundle reconcile attempt now leads to a
`TerminalError` being returned by the reconciler, which prevents retries
[1] while being more intuitive than returning nil errors.

[1]: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#TypedReconciler
The bundle reconciler now uses a new method to distinguish errors
between retryable and non-retryable ones, and to compute reconcile
results and errors accordingly, including bundle status updates.
`req` was never actually used in the bundle reconciler's
`updateErrorStatus`.
Reconcilers need to be able to report errors following common patterns,
including:
* detecting whether an error is retryable, and producing a reconcile
  result and error if so
* populating an error in a condition within a resource status

This commit implements such common logic, to be expanded (and used
across multiple reconcilers) as more patterns emerge.
Setting a successful condition can also be done using that logic.
@weyfonk weyfonk force-pushed the improve-status-error-reporting branch from 0164447 to 62a3c22 Compare October 13, 2025 15:09
@weyfonk weyfonk merged commit dbfd59a into rancher:main Oct 14, 2025
35 of 36 checks passed
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Fleet Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants