-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Context:
We're a GKE Standard Cluster running 1.31.6-gke.1020000
using the native FUSE feature.
We've experienced an issue where some sort of activity causes the FUSE process to crash, specifically the sidecar container (pasted from GCP Log Explorer sorry):
gke-gcsfuse-sidecar
fuse: *fuseops.LookUpInodeOp error: input/output error
gke-gcsfuse-sidecar
fatal error: sync: unlock of unlocked mutex
gke-gcsfuse-sidecar
sync.fatal({0x17c3c28?, 0xc0008cb7e0?}) (for brevity, long stacktrace of the goroutine crashing)
gke-gcsfuse-sidecar
gcsfuse exited with error: exit status 2
However the sidecar init container never exists or indicates it has its own problem. Therefore, the natural GKE reaction of trying to reboot our main
container which is mounting the shared volume (which is failing) never recovers the Pod on it's own.
We see this error on our Pod/main container as:
Error: failed to generate container "3a947804065a11c2d9337520ea746f24566ad1ee7c372fc320586393ef7a4dd6" spec: failed to generate spec: failed to stat "/var/lib/kubelet/pods/4383ae58-bd17-4191-aca6-05a4f75b9872/volumes/kubernetes.io~csi/gcs-fuse-csi-ephemeral/mount": stat /var/lib/kubelet/pods/4383ae58-bd17-4191-aca6-05a4f75b9872/volumes/kubernetes.io~csi/gcs-fuse-csi-ephemeral/mount: transport endpoint is not connected
Our sidecar details from GKE:
Image: gke.gcr.io/gcs-fuse-csi-driver-sidecar-mounter:v1.8.3-gke.2@sha256:07a5a7b18b083c47031c540e1664eb0c777a50e523dde030d8b0effdc9bb8761
Command Args: --v=5
Env Vars: NATIVE_SIDECAR=TRUE
My own analysis is this is a bug with the sidecar container, which should have a way to "self-recover" from FUSE process crashes, have a liveness check based on the health of that process, or just fatally crash itself if the FUSE process crashes.
We were able to recover by a deployment rollout restart, so I gather this was triggered by some transient GCS or GKE problem.