-
Notifications
You must be signed in to change notification settings - Fork 247
Improve bundle and GitRepo status error reporting #4167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve bundle and GitRepo status error reporting #4167
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a few errors that are maybe too internal to be propagated to the user (like the finalizer one) and we might be missing the retry mechanism from controller-runtime.
I added comments to a few (not all), but basically I think any error coming from an API server call could be a potential candidate to be requeued.
c2a03ae
to
3c3c617
Compare
3c3c617
to
cea457b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we could standardise how we deal with errors across controllers.
I like that the Bundle Controller returns a TerminalError when the error should not be Retryable, we could maybe use that also in the Gitops controller.
We could do that in a separated PR if we need to merge this with any kind of urgency, just thought that it was a great chance for standardising things
if errors.Is(err, fleetutil.ErrRetryable) { | ||
logger.Info(err.Error()) | ||
return ctrl.Result{RequeueAfter: durations.DefaultRequeueAfter}, nil | ||
} | ||
|
||
return ctrl.Result{}, r.updateErrorStatus(ctx, req.NamespacedName, bundleOrig, bundle, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file is using this block more than once.
Would it make sense to make it a function?
And.. I think it would make sense to make it a function accessible from all controllers so we can use it as the standard for errors that might (or not) be shown to the user via conditions or retried.
That way, when we detect an error in any controller we do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, great idea, see the new reconciler/error_handling.go
:)
// Setting that condition makes the error message visible in the Rancher UI. | ||
func (r *BundleReconciler) updateErrorStatus( | ||
ctx context.Context, | ||
req types.NamespacedName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why the linter does not complain about req
.
It's not used in the function... is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also .... same remark about the other block checking if it's retrayable, etc...
We might want to add this function to a shared package accesible for all controllers.
I see we have the same in the gitops controller using the Accepted
condition instead.
Maybe we can have the condition as a parameter?
Would be cool to have like a standard way for dealing with errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, parameter removed, thanks!
I've added a new function to set an error on a parametrised condition, see reconciler/error_handling.go
.
cea457b
to
99d35c3
Compare
99d35c3
to
0164447
Compare
When a git job cannot be created, for instance because a Helm secret cannot be found, Fleet now propagates the error to the GitRepo's status message, meaning that looking through gitOps controller logs is no longer necessary in such cases.
Reconciler methods now live together, all placed before functions.
The bundle reconciler must return a failure when the chart version set on a bundle is a non-strict version, as that is a symptom of a failure, or a needed delay, on the HelmOp controller's end to do its job, including resolving version constraints to static versions.
As unlikely as they may seem, failures to process a bundle resources' manifest for OCI storage should also lead to the corresponding error being reported in the bundle's status, for easier troubleshooting.
In such cases, the error is propagated to the bundle status, as it already was. The novelty here is that the reconciler returns that error.
When the bundle reconciler fails to create a Content resource with a bundle resource's manifest, it now propagates that error to the bundle's status, instead of just returning a reconciler error.
When the bundle reconciler fails to create, update or delete a Helm values secret for a bundle, or to extract Helm values from the bundle as a preliminary step to creating such a secret, the corresponding error is now propagated to the bundle status to ease troubleshooting.
When the bundle reconciler fails to load a bundle's Helm values secret, it now propagates the error to the bundle's status to ease troubleshooting.
In case the bundle reconciler is unable to add a finalizer to a new bundle, the reconciler propagates the corresponding error to the bundle status, for easier troubleshooting.
When the bundle reconciler is unable to retrieve an access secret for OCI storage, it now propagates the corresponding error to the bundle status for easier troubleshooting.
When the bundle reconciler fails to compute partitions or update the bundle status based on those partitions, it propagates the corresponding error to the bundle status, for easier troubleshooting.
When the bundle reconciler fails to clone an access secret, it now propagates the corresponding error to the bundle status, for easier troubleshooting.
Individual test cases are now slightly more concise.
Failures to add a finalizer to a bundle are internal to Fleet, and should be temporary. Propagating them to the bundle's status would not help the user much.
Transient errors, related to interaction with the API server when storing bundle deployment manifests, are now logged instead of being propagated to bundle statuses. Errors which cannot be resolved through a retry, though, are still propagated.
The secret involved in that logic is the one containing the bundle's contents. It is not the access secret for the OCI registry.
When the bundle reconciler fails to create, update or delete an options secret, that error comes from the Kubernetes API server, and therefore leads to retries, without the user being able to (nor needing to) do anything to mitigate the error. Therefore, Fleet simply retries in such cases, and logs the error instead of propagating it to the bundle status.
Fleet will requeue and log retryable errors, but propagate non-retriable ones to the bundle's status.
Possible errors, except for controller-runtime framework internals, are all retryable and therefore do not lead to bundle status updates, but rather to logs and requeues.
When the bundle reconciler fails with a non-retryable error, it now returns a nil error to prevent requeues.
A non-retryable error in a bundle reconcile attempt now leads to a `TerminalError` being returned by the reconciler, which prevents retries [1] while being more intuitive than returning nil errors. [1]: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#TypedReconciler
The bundle reconciler now uses a new method to distinguish errors between retryable and non-retryable ones, and to compute reconcile results and errors accordingly, including bundle status updates.
`req` was never actually used in the bundle reconciler's `updateErrorStatus`.
Reconcilers need to be able to report errors following common patterns, including: * detecting whether an error is retryable, and producing a reconcile result and error if so * populating an error in a condition within a resource status This commit implements such common logic, to be expanded (and used across multiple reconcilers) as more patterns emerge.
Setting a successful condition can also be done using that logic.
0164447
to
62a3c22
Compare
This propagates a number of errors, which can happen in the process of reconciling a resource, to that resource's status:
Refers to #3617.
This patch is a preliminary step to making the changes referenced in that issue.
Most of its lines come from new unit test cases used to validate error status updates in the cases mentioned above.
A new file in the
reconciler
package,error_handling.go
, contains extracted logic to be reused across reconcilers for easier and more consistent error reporting.Additional Information
Checklist
fleet-docs repository.