Replies: 1 comment 1 reply
-
I think your main container runc killed by OOM. check you running pod |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have a problem which I initially though was related to this.
I have a workflow which usually runs through fine. It creates hundreds of pods, processing data, each one usually completing without issue.
However, every once in a while, the main container does not move out of its "Running" Phase (see the screenshot).
This pod should take a couple minutes but it's been sitting in Running phase for 13 hrs.
Initially, I thought there was a bug in the code, keeping the container alive, so I added a bunch of logging - it does seem that the code executes successfully. In my case, the last log item just says that an upload completed successfully.
Further, these "hanging pod" problems occur sporadically - it's not necessarily the same piece of data which it fails on, which leads me to believe that it's not a data issue.
The wait container is of course still running (which is why I thought maybe it was related to the issue I linked above), but I guess the wait container is just still running because the main container is still running?
I'm not sure how to debug this - any advice would be greatly appreciated :)
(argo-workflows v3.0.8)
Further:
when I try to exec into the main container of the hanging pod by:
kubectl exec -it --container main -- /bin/bash
I get:
OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "process_linux.go:101: executing setns process caused "exit status 1"": unknown
command terminated with exit code 126
However, I am able to exec into the wait container. Also, I am able to exec into the main container of a healthy pod (before it completes, of course).
Beta Was this translation helpful? Give feedback.
All reactions