Envoy Sigterm for high connections #6593

micxba · 2025-07-24T22:21:36Z

micxba
Jul 24, 2025

We recently went live with Envoy in production and while we don't notice any impact from the outside we notice some off behavior in our metrics.

Couple facts about our setup:

Envoy deployed via HELM -> ArgoCD
HPA configured with min 10 max 20 for envoy proxy
CPU limits set to 2000m (currently)
patched the liveness and readiness probe to the below

        patch:
          type: StrategicMerge
          value:
            spec:
              template:
                spec:
                  containers:
                    - name: envoy
                      livenessProbe:
                        failureThreshold: 3
                        httpGet:
                          path: /ready
                          port: 19003
                          scheme: HTTP
                        periodSeconds: 10
                        innitialDelaySeconds: 15
                        successThreshold: 1
                        timeoutSeconds: 2
                      readinessProbe:
                        failureThreshold: 3
                        httpGet:
                          path: /ready
                          port: 19003
                          scheme: HTTP
                        periodSeconds: 10
                        initialDelaySeconds: 5
                        successThreshold: 1
                        timeoutSeconds: 2

We have approx 600 req/second on that GW
We have a single GW with 600+ HTTRoutes attached.

What we are experiencing is that even with above values, some pods "fail" and get recycled by K8s. Before we made the liveness probe adjustments above we saw sudden CPU spikes (150% avg) that caused the majority of the pods to fail.

My question is - is there something generally wrong with our configuration? It appears that pods should be generally stable but we experience a "crash" of 5-10 pods every 5-10 minutes.

Since we do seem to have a large amount of connections (upwards of 70,000) do we have to adjust:

    minDrainDuration: 10s
    drainTimeout: 60s
    terminationGracePeriodSeconds: 360s

??

A screenshot of logs below in case they help.

micxba · 2025-07-25T17:39:40Z

micxba
Jul 25, 2025
Author

Well, we got it tuned to a way that those connections stop happening but we still experience very high heap usage in 'some' pods and those might eventually result in 'envoy overloaded' errors.

1 reply

arkodg Jul 25, 2025
Maintainer

the first problem sounds like you have connections not closing either because the requests have not completed yet
increasing shutdown.drainTimeout should make sure connection churn is more graceful

the second problem is around triggering the overload manager, either because heap exceeded 80% or max connections exceeded 100k

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Envoy Sigterm for high connections #6593

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Envoy Sigterm for high connections #6593

Uh oh!

micxba Jul 24, 2025

Replies: 1 comment · 1 reply

Uh oh!

micxba Jul 25, 2025 Author

Uh oh!

arkodg Jul 25, 2025 Maintainer

micxba
Jul 24, 2025

Replies: 1 comment 1 reply

micxba
Jul 25, 2025
Author

arkodg Jul 25, 2025
Maintainer