-
Couldn't load subscription status.
- Fork 143
Description
We encountered situation, where ztunnel was not able to allocate a new FD:
2025-08-19T08:34:42.174722022Z stdout F {"level":"error","time":"2025-08-19T08:34:42.174543Z","scope":"ztunnel::proxy::inbound_passthrough","message":"Failed TCP handshake Too many open files (os error 24)","proxy":{"wl":"ingress-nginx-public-tck-cf/apigateway-lua-tck-cf-ingress-nginx-controller-68d95d4c99-m2hbx"}}
2025-08-19T08:34:42.174727533Z stdout F {"level":"error","time":"2025-08-19T08:34:42.174546Z","scope":"ztunnel::proxy::inbound_passthrough","message":"Failed TCP handshake Too many open files (os error 24)","proxy":{"wl":"ingress-nginx-public-tck-cf/apigateway-lua-tck-cf-ingress-nginx-controller-68d95d4c99-m2hbx"}}
2025-08-19T08:34:42.174733223Z stdout F {"level":"error","time":"2025-08-19T08:34:42.174549Z","scope":"ztunnel::proxy::inbound_passthrough","message":"Failed TCP handshake Too many open files (os error 24)","proxy":{"wl":"ingress-nginx-public-tck-cf/apigateway-lua-tck-cf-ingress-nginx-controller-68d95d4c99-m2hbx"}}
Because of that, ztunnel was failing the readiness probe (hence, Running 0/1)
Yet, the pod didn't crush/restart (due to the lack of liveness probe, which it seems it is a deliberate decision, at least in case of sidecarful version of istio).
I think that in such scenario ztunnel should immediately crush - it doesn't work anyway and is impacting connectivity to/from the node for the workloads inside the mesh. And in case there is an FD leakage in ztunnel itself (which was the case for us), restarting the pod would free up occupied descriptors, hence (temporarily) fixing the problem.