death by workflow never reconciled
#13460
Replies: 6 comments
-
this is huge problem, that happens to me even when the controller is only at 30% memory utilization |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
I am currently investigating load issues, and I found that it's the workflow-never-reconciled problem again. So this is now related to discussion: workflow controller keeps crashing under load One question first: what does "workflow never reconciled" actually mean. What's the problem? And why is that bad?
I don't think there is a workflow recipe for this problem. I just provoked it in my test cluster, by starting 1000+ workflows. Meaning, this is a load problem. What needs to be done, so that my Argo deployment can live up to this promise? (Running At Massive Scale)
What I tried so far:
I'm not even sure if this is Kubernets API pressure. I can still execute kubectl commands. Some might take a couple seconds (like list all workflows), but I don't see how this could result in 5m delays. I'm at a loss here, because I don't know what to tweak to solve this problem. Nothing get processed anymore, all workflows fail eventually, because of "Step exceeded its deadline" ... just because I have ~1400 workflows in a backlog? This is basically what happens then: |
Beta Was this translation helpful? Give feedback.
-
The way I understand it (after going through the linked resources again): Argo didn't put a status on some of the workflows (yet) and because of that considers them as "workflow never reconciled"? Or did I get that wrong? BTW: Why is this logged as info, when the consequences are quite fatal (crashing workflow controller and literally stop working):
Anyway, I set ...the worker pool seems to be less stressed ... ... and workflows are getting executed again ... So far it looks like, the default |
Beta Was this translation helpful? Give feedback.
-
Just some thoughts: I didn't look into the code, because I wouldn't understand it anyway (I'm not a programmer) ... but I have a suspicion: From the outside (high level perspective) it looks like a deadlock. I don't say it IS a deadlock, but it looks like one. FACTS/SYMPTOMS:
THEORY: If I look a the graphs (like the one with the busy workers, and my workflow monitoring), the way I read it: |
Beta Was this translation helpful? Give feedback.
-
Can someone (who understands the source code) please take a look at the workers. Why are they busy like that with 1000+ workflows in the backlog? What are they doing? Are they really blocked or something? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The general story, as I understand it:
An unknown number of workflows got into a state, which was reported as "workflow never reconciled". This lead to the health check to fail over and over again ... which basically killed the workflow controller. Meanwhile more and more workflows are being put into the backlog, but nothing gets executed anymore. The ever increasing number of pending/pre-pending workflows also wasn't helpful (I guess). The system died ... probably in the moment when the first workflows entered that problematic state.
This massive gap indicates, that no workflows have been executed in that time. In the ui I found countless workflows, pending, waiting to be scheduled by the workflow controller, which at this point only panicked about an unknown number of "workflow never reconciled" warnings and totally forgot to care about the rest. Also, the problem seemed to start abruptly.
The number of restarts indicates, that the workflow controller was automatically restarted every ~6 minutes.
$ kubectl describe pod -n argo workflow-controller-76ffbdcf8f-57p67 [...] Image: quay.io/argoproj/workflow-controller:v3.5.8 [...] State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 2 Started: Mon, 12 Aug 2024 16:32:23 +0000 Finished: Mon, 12 Aug 2024 16:36:23 +0000 Ready: False Restart Count: 1190 Liveness: http-get http://:6060/healthz delay=90s timeout=30s period=60s #success=1 #failure=3 [...] Events: Type Reason Age From Message ---- ------ ---- ---- ------- [...] Warning Unhealthy 5m46s (x3575 over 6d10h) kubelet Liveness probe failed: HTTP probe failed with statuscode: 500 [...]
From the logs I can assume, that at least 3 workflows had that never-reconciled state. The actual number is unknown.
I have no idea what the root cause is. It took me over 4 hours to manually clean this up (I had to manually delete >30k workflows from my system). After that was finished, the system recovered. I find it concerning though, that an invalid workflow state can bring the entire system down.
(1) Does anyone know how this state ("workflow never reconciled") comes to be?
(2) Does anyone know what I could do to prevent/stop a cascade like this to keep the workflow system alive or to allow auto recovery?
(3) How can it be, that this "workflow never reconciled" problem completely breaks the workflow controller and renders it useless? I mean, It's just a bunch of invalid workflow states, right? Why can't the workflow controller just ignore them (in terms of usual business) and keep going with the rest that isn't broken? Of course, the workflow controller should report them. But is that a good enough reason to stop the world?
Thankfully, this only happened in a testing environment. BUT. I got to say, my confidence into Argo Workflows really took a hit here. If this happens in production, I'll have some explaining to do, so I need a little help here.
Beta Was this translation helpful? Give feedback.
All reactions