death by `workflow never reconciled` #13460

static-moonlight · 2024-08-13T08:58:06Z

static-moonlight
Aug 13, 2024

The general story, as I understand it:

An unknown number of workflows got into a state, which was reported as "workflow never reconciled". This lead to the health check to fail over and over again ... which basically killed the workflow controller. Meanwhile more and more workflows are being put into the backlog, but nothing gets executed anymore. The ever increasing number of pending/pre-pending workflows also wasn't helpful (I guess). The system died ... probably in the moment when the first workflows entered that problematic state.

This massive gap indicates, that no workflows have been executed in that time. In the ui I found countless workflows, pending, waiting to be scheduled by the workflow controller, which at this point only panicked about an unknown number of "workflow never reconciled" warnings and totally forgot to care about the rest. Also, the problem seemed to start abruptly.

$ kubectl get pods -n argo
NAME                                   READY   STATUS    RESTARTS           AGE
argo-server-7569f7f459-brkr9           1/1     Running   0                  24d
database-0                             1/1     Running   0                  24d
workflow-controller-76ffbdcf8f-57p67   1/1     Running   1190 (3m29s ago)   24d

The number of restarts indicates, that the workflow controller was automatically restarted every ~6 minutes.

$ kubectl describe pod -n argo workflow-controller-76ffbdcf8f-57p67
[...]
    Image:         quay.io/argoproj/workflow-controller:v3.5.8
[...]
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 12 Aug 2024 16:32:23 +0000
      Finished:     Mon, 12 Aug 2024 16:36:23 +0000
    Ready:          False
    Restart Count:  1190
    Liveness:       http-get http://:6060/healthz delay=90s timeout=30s period=60s #success=1 #failure=3
[...]
Events:
  Type     Reason     Age                       From     Message
  ----     ------     ----                      ----     -------
[...]
  Warning  Unhealthy  5m46s (x3575 over 6d10h)  kubelet  Liveness probe failed: HTTP probe failed with statuscode: 500
[...]

$ kubectl logs -n argo workflow-controller-76ffbdcf8f-57p67 | grep healthz
time="2024-08-12T16:34:23.226Z" level=info msg=healthz age=5m0s err="workflow never reconciled: [...]-n52m2" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=
time="2024-08-12T16:35:23.226Z" level=info msg=healthz age=5m0s err="workflow never reconciled: [...]-p8mbg" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=
time="2024-08-12T16:36:23.222Z" level=info msg=healthz age=5m0s err="workflow never reconciled: [...]-sjj9j" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=

From the logs I can assume, that at least 3 workflows had that never-reconciled state. The actual number is unknown.

I have no idea what the root cause is. It took me over 4 hours to manually clean this up (I had to manually delete >30k workflows from my system). After that was finished, the system recovered. I find it concerning though, that an invalid workflow state can bring the entire system down.

(1) Does anyone know how this state ("workflow never reconciled") comes to be?

(2) Does anyone know what I could do to prevent/stop a cascade like this to keep the workflow system alive or to allow auto recovery?

(3) How can it be, that this "workflow never reconciled" problem completely breaks the workflow controller and renders it useless? I mean, It's just a bunch of invalid workflow states, right? Why can't the workflow controller just ignore them (in terms of usual business) and keep going with the rest that isn't broken? Of course, the workflow controller should report them. But is that a good enough reason to stop the world?

Thankfully, this only happened in a testing environment. BUT. I got to say, my confidence into Argo Workflows really took a hit here. If this happens in production, I'll have some explaining to do, so I need a little help here.

ospiegel91 · 2024-09-01T09:52:12Z

ospiegel91
Sep 1, 2024

this is huge problem, that happens to me even when the controller is only at 30% memory utilization

0 replies

agilgur5 · 2024-09-01T16:29:00Z

agilgur5
Sep 1, 2024

As far as I've seen, this only happens in one of two scenarios:
1. lack of CPU and/or k8s API pressure/throttling/rate limiting. This is the case of Liveness probe fails with 500 and workflow never reconciled -- under-resourced k8s apiserver #11051 (comment)
2. An invalid Workflow that the controller is having trouble resolving. This is the case of Don't Crash Controller when an Invalid Workflow Object is present #10872
Depends on the scenario:
1. As far as I understand, the healthcheck failing in this case seems intentional, as it will relieve some of this pressure and may finish processing some during init (before it gets throttled again). So that is supposed to facilitate auto recovery as I understand
  1. Increasing CPU and k8s control plane resources will help mitigate this. You can also configure the healthcheck to last longer, including with the HEALTHZ_* environment variables.
2. These are all bugs. The Workflow Informer is "tolerant" in that it knows to ignore invalid workflows, but it seems like sometimes an edge case is not handled. The error message should name the problematic workflow and filing a reproducible bug report with it will help to debug a fix.
  1. Doing validation on all Workflows beforehand, via the API or argo lint etc, can help mitigate that. But it's certainly possible for a validation miss to happen there. For k8s submissions, see Validating Admission Webhook #13503
  2. You can also disable the liveness probe and alert on these problematic workflows instead.
In the first scenario, the liveness failure seems to be intended. In the second scenario, indeed the Controller normally ignores them, but it seems like sometimes an edge case is unhandled

0 replies

static-moonlight · 2025-02-28T11:47:24Z

static-moonlight
Feb 28, 2025
Author

I am currently investigating load issues, and I found that it's the workflow-never-reconciled problem again. So this is now related to discussion: workflow controller keeps crashing under load

One question first: what does "workflow never reconciled" actually mean. What's the problem? And why is that bad?

The error message should name the problematic workflow and filing a reproducible bug report with it will help to debug a fix.

I don't think there is a workflow recipe for this problem. I just provoked it in my test cluster, by starting 1000+ workflows. Meaning, this is a load problem.

What needs to be done, so that my Argo deployment can live up to this promise? (Running At Massive Scale)

it effortlessly handles hundreds of thousands of smaller workflows daily

What I tried so far:

I increased the replicas for Argo server to 3
I increased the worker size to 100 (which did nothing, really)
I increased the health check to 10m (the workflow controller is still crashing)
I wanted to adjust memory and CPU limits, and then I realized, that the base deployment (which I am using), doesn't have any resource limits, so that shouldn't be an issue

I'm not even sure if this is Kubernets API pressure. I can still execute kubectl commands. Some might take a couple seconds (like list all workflows), but I don't see how this could result in 5m delays.

I'm at a loss here, because I don't know what to tweak to solve this problem.

Nothing get processed anymore, all workflows fail eventually, because of "Step exceeded its deadline" ... just because I have ~1400 workflows in a backlog? This is basically what happens then:

0 replies

static-moonlight · 2025-02-28T13:54:16Z

static-moonlight
Feb 28, 2025
Author

The way I understand it (after going through the linked resources again): Argo didn't put a status on some of the workflows (yet) and because of that considers them as "workflow never reconciled"? Or did I get that wrong?

BTW: Why is this logged as info, when the consequences are quite fatal (crashing workflow controller and literally stop working):

level=info msg=healthz age=10m0s err="workflow never reconciled: [...]" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=

Anyway, I set HEALTHZ_AGE to 10h (yep, 10 hours, because why not, it can't break anymore than it already does) and that seems to be actually helping ... most importantly: the controller isn't crashing anymore ...

...the worker pool seems to be less stressed ...

... and workflows are getting executed again ...

So far it looks like, the default HEALTHZ_AGE of 5m is my biggest enemy when dealing with workflow bursts.

0 replies

static-moonlight · 2025-02-28T16:27:49Z

static-moonlight
Feb 28, 2025
Author

Just some thoughts:

I didn't look into the code, because I wouldn't understand it anyway (I'm not a programmer) ... but I have a suspicion: From the outside (high level perspective) it looks like a deadlock. I don't say it IS a deadlock, but it looks like one.

FACTS/SYMPTOMS:

I put 1000+ workflows into the system at once
Argo has a limited set of workers (32 by default, I don't even know what they are actually doing, but a have a vague idea)
the workflow controller complains about not-reconciled workflows and starts crashing
nothing gets done anymore, like literally nothing(!)

THEORY:

If I look a the graphs (like the one with the busy workers, and my workflow monitoring), the way I read it:
32/100 workers are busy doing something (maybe waiting?), and I guess whatever they are doing is related to the not-reconciled workflows. Meanwhile, no workers are left to actually do something, like setting the status to said workflows, start pods, and make sure Kubernetes gets something to do. Is that possible?

0 replies

static-moonlight · 2025-03-19T12:21:29Z

static-moonlight
Mar 19, 2025
Author

Can someone (who understands the source code) please take a look at the workers. Why are they busy like that with 1000+ workflows in the backlog? What are they doing? Are they really blocked or something?

0 replies

death by workflow never reconciled #13460

Uh oh!

Uh oh!

static-moonlight Aug 13, 2024

Replies: 6 comments

Uh oh!

Uh oh!

ospiegel91 Sep 1, 2024

Uh oh!

Uh oh!

agilgur5 Sep 1, 2024

Uh oh!

static-moonlight Feb 28, 2025 Author

Uh oh!

Uh oh!

static-moonlight Feb 28, 2025 Author

Uh oh!

Uh oh!

static-moonlight Feb 28, 2025 Author

Uh oh!

static-moonlight Mar 19, 2025 Author

death by `workflow never reconciled` #13460

static-moonlight
Aug 13, 2024

ospiegel91
Sep 1, 2024

agilgur5
Sep 1, 2024

static-moonlight
Feb 28, 2025
Author

static-moonlight
Feb 28, 2025
Author

static-moonlight
Feb 28, 2025
Author

static-moonlight
Mar 19, 2025
Author