DO NOT MERGE: An attempt at restarting the controller manager without restarting the whole operator binary #1117

metlos · 2025-01-07T12:26:25Z

This is not meant for merging.

It is an experiment at soft-restarting the controller manager within the process of the host operator so that we don't have to rely on external tooling (i.e. ksctl) to manually restart the operator when the set of the registered members changes. The operator itself could detect these occasions and "heal" itself be soft-restarting so that further reconciliations work with the up-to-date set of members. In the longer terms this could also give us the opportunity to consolidate the explicit set of member clusters established at the operator start with the "cluster cache" used in the toolchaincluster controller and health checks.

Let's discuss here whether this potential approach has some merit and whether this solution is actually what we'd want.

At the moment, the restarts are just triggered every minute. This is of course not how it would be working for real - we would request a restart only when a new member is detected or when an existing one is removed.

The frequent restarts are put in place so that we can judge the effects on the memory consumption and/or see any leaks that might be happening in the controller manager due to the way we restart it.

I have ran the experimental version of the host-operator in a cluster and was continuously measuring the memory consumption for 5 hours after creating 5000 user signups in the cluster. This means that a non-trivial amount of objects needed to be loaded in the cache managed by the controller manager and therefore we should be seeing the effects of the leaks if there were any.

So, in the experiment, the host-operator was reconciling 5000 user signups, restarting every minute, for 5 hours. The results are compiled in this spreadsheet: https://docs.google.com/spreadsheets/d/1maSd_KUUYBwA3U1d3UdSMPaQkEwXYmbhWg4duCBvfO0/edit?gid=0#gid=0.

The memory consumption seems quite stable to me:

…e whole operator binary + a very nasty way of seeing if it works - the controller manager should restart every minute :)

openshift-ci · 2025-01-07T12:26:30Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2025-01-07T12:26:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: metlos

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [metlos]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sonarqubecloud · 2025-01-07T12:26:50Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

MatousJobanek

Thanks a lot for this experimental PR, it looks very promising.

main.go

pkg/restart/restart.go

MatousJobanek · 2025-01-08T15:07:11Z

pkg/restart/restart.go

+		debug("doStart: about to invoke mgr.Start()")
+		if startErr := mgr.Start(s.runningContext); startErr != nil {
+			debug("doStart: mgr.Start() returned error. Sending it.")
+			err <- startErr


if the manager cannot be started, then we can just simply force-kill, right?

Suggested change

err <- startErr

os.Exit(1)

There is nothing we could try to recover. Or is there any other cleanup we would need to do in this case?

I have one more note/question - I understand the usage of the channel in this case, but I don't understand why we need to use the channel for all other error cases returned from this method. Is it to only unify the error handling, or is there any other reason?

I am not a fan of panicking in "library" code.

As for the error handling - yes, the error is always returned in the channel so that the caller doesn't have to have special logic for when the "setup" of the start fails as opposed to the failure that happens async-ly during the startup and lifetime of the controller manager.

MatousJobanek · 2025-01-08T15:27:29Z

pkg/restart/restart.go

+
+				// now, let's see what caused the cancellation
+
+				if !errors.Is(context.Cause(s.runningContext), restartNeededError) {


checking the cause of the cancel function and comparing it is a bit weird to me - it reminds me catching exceptions in Java. Not sure if this is a best practice in Go though.

Instead of using "CancelCause Context" you could use the standard "Cancel context" and add another field to the StartManager stating if the restart was triggered or not (in the same way as you do it with the running field. And have another case when the root context itself:

case <-ctx.Done(): debug("Start: context cancelled, cancellation from above") <-finished debug("Start: quitting")

errors.Is() is indeed a function defined in Go itself and is used quite extensively. What I like about the context.Cause() is that at least for my way of thinking, sending "messages" is more understandable than setting flags somehow "on the side".

Also, I am not 100% certain about the Go memory model and what happens if you cancel the context in one thread and handle the cancellation in another - would the update to the flag be visible across threads?

…d some comments

sonarqubecloud · 2025-03-28T07:44:35Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

codecov · 2025-03-28T07:46:14Z

Codecov Report

Attention: Patch coverage is 81.65138% with 20 lines in your changes missing coverage. Please review.

Project coverage is 83.72%. Comparing base (ce0f8fc) to head (120aaac).

Files with missing lines	Patch %	Lines
pkg/restart/restart.go	81.65%	16 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1117      +/-   ##
==========================================
- Coverage   83.75%   83.72%   -0.03%     
==========================================
  Files          82       83       +1     
  Lines        7929     8038     +109     
==========================================
+ Hits         6641     6730      +89     
- Misses       1091     1107      +16     
- Partials      197      201       +4

Files with missing lines	Coverage Δ
pkg/restart/restart.go	`81.65% <81.65%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

an attempt at restarting the controller manager without restarting th…

7bf2ff2

…e whole operator binary + a very nasty way of seeing if it works - the controller manager should restart every minute :)

openshift-ci bot added the do-not-merge/work-in-progress label Jan 7, 2025

openshift-ci bot added the approved label Jan 7, 2025

openshift-merge-robot added the needs-rebase label Jan 7, 2025

MatousJobanek reviewed Jan 8, 2025

View reviewed changes

merge master

dfbeb8f

openshift-merge-robot removed the needs-rebase label Mar 27, 2025

metlos and others added 4 commits March 27, 2025 17:25

update the MockManager to the newer version of controller-runtime, ad…

dceaf38

…d some comments

pass context into StartManager.InitializeManager

1358fe2

simplify the error handling on the early start fail

123f4da

Merge branch 'master' into graceful-restart-experiment

120aaac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DO NOT MERGE: An attempt at restarting the controller manager without restarting the whole operator binary #1117

DO NOT MERGE: An attempt at restarting the controller manager without restarting the whole operator binary #1117

Uh oh!

metlos commented Jan 7, 2025

Uh oh!

openshift-ci bot commented Jan 7, 2025

Uh oh!

openshift-ci bot commented Jan 7, 2025

Uh oh!

sonarqubecloud bot commented Jan 7, 2025

Uh oh!

MatousJobanek left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MatousJobanek Jan 8, 2025

Uh oh!

MatousJobanek Jan 8, 2025

Uh oh!

metlos Mar 27, 2025

Uh oh!

MatousJobanek Jan 8, 2025

Uh oh!

metlos Mar 27, 2025

Uh oh!

sonarqubecloud bot commented Mar 28, 2025

Uh oh!

codecov bot commented Mar 28, 2025

Uh oh!

Uh oh!


		// now, let's see what caused the cancellation

		if !errors.Is(context.Cause(s.runningContext), restartNeededError) {

DO NOT MERGE: An attempt at restarting the controller manager without restarting the whole operator binary #1117

Are you sure you want to change the base?

DO NOT MERGE: An attempt at restarting the controller manager without restarting the whole operator binary #1117

Uh oh!

Conversation

metlos commented Jan 7, 2025

Uh oh!

openshift-ci bot commented Jan 7, 2025

Uh oh!

openshift-ci bot commented Jan 7, 2025

Uh oh!

sonarqubecloud bot commented Jan 7, 2025

Quality Gate passed

Uh oh!

MatousJobanek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MatousJobanek Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

MatousJobanek Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

metlos Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

MatousJobanek Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

metlos Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Mar 28, 2025

Quality Gate passed

Uh oh!

codecov bot commented Mar 28, 2025

Codecov Report

Uh oh!

Uh oh!