fix: app and envoy container not shutting down gracefully #2155

cre8ivejp · 2025-10-09T02:28:49Z

The HTTP server is shutting down gracefully, but the gRPC is not, causing requests to fail instantly when the pods get the SIGTERM signal.

This pull request introduces several improvements to the health check endpoints, readiness/liveness/startup probes, and connection handling for rolling updates in the Bucketeer API deployment. The main goals are to support more robust health signaling, ensure smoother rolling updates, and improve reliability for clients during deployments. Below are the most important changes grouped by theme.

Health Check and Readiness Improvements

Added /ready endpoint to both API and web backends in nginx/bucketeer.conf for readiness checks, alongside the existing /health endpoint. [1] [2]
Updated Kubernetes deployment probes to use /ready for readiness, liveness, and startup checks, replacing /health where appropriate. [1] [2] [3] [4]
Improved health check configuration in backend-config.yaml, adding interval and threshold settings for more frequent and responsive health checks.

Rolling Update and Connection Handling

Reduced Envoy idle, stream, and request timeouts from 620s/30s to 60s to force clients to reconnect during rolling updates, preventing stale connections and 499 errors.
Enhanced preStop hook logic in the deployment to gracefully drain Envoy listeners and allow time for load balancer propagation, improving pod termination behavior.

Timeouts and Retry Policies

Increased route timeouts for all API and web service endpoints from 35s to 60s in Envoy config, and raised retry attempts from 1 to 3 for better resilience under transient failures. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

Deployment Configuration Enhancements

Added configurable terminationGracePeriodSeconds and strategy options to the deployment spec for more flexible rollout and shutdown behavior. [1] [2]

Let me know if you have questions about how these changes affect deployment, readiness checks, or rolling update reliability!
This pull request introduces several improvements to health check handling, graceful shutdown, and timeout configuration for Bucketeer's API and batch services. The main focus is on adding /ready readiness endpoints, enhancing shutdown scripts to wait for active requests, and increasing timeouts for web service routes. These changes aim to improve reliability during deployments and ensure smoother traffic handling.

Health check and readiness improvements:

Added /ready readiness endpoints alongside /health in both NGINX configuration (bucketeer.conf) and Envoy config, allowing Kubernetes and other orchestrators to distinguish between liveness and readiness checks. [1] [2] Fedcfaa8L224R224)
Updated Kubernetes deployment manifests for both API and batch services to use /ready for readiness probes and /health for liveness probes, aligning with best practices. (Fe59bce0L161R161, F9765c1cL197R197)

Graceful shutdown enhancements:

Improved preStop lifecycle shutdown scripts in both API and batch deployments to wait for active requests to complete (using Envoy stats) before terminating, instead of just waiting for all TCP connections to close. This helps prevent dropped requests during rolling updates. (Fe59bce0L175R175, F9765c1cL213R213)

Timeout and probe configuration:

Increased timeouts for all web service routes in Envoy from 35s to 45s to reduce the likelihood of premature request termination. (Fedcfaa8L316R316, Fedcfaa8L329R329, Fedcfaa8L342R342, Fedcfaa8L355R355, Fedcfaa8L368R368, Fedcfaa8L381R381, Fedcfaa8L394R394, Fedcfaa8L407R407, Fedcfaa8L420R420, Fedcfaa8L446R446, Fedcfaa8L459R459)
Adjusted liveness and readiness probe parameters (e.g., periods, thresholds, timeouts) in the Helm values file for more responsive and robust health checking.

Other configuration updates:

Made terminationGracePeriodSeconds configurable (default 30s) in both API and batch deployments for more graceful pod shutdowns. [1] [2]
Added a deprecation notice for the grpc-web filter in Envoy, indicating plans for future removal.

…fic spikes Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

Copilot

Pull Request Overview

This PR fixes graceful shutdown of gRPC servers to prevent request failures when pods receive SIGTERM signals. The current issue was that while HTTP servers shut down gracefully, gRPC servers did not, causing instant failures during pod termination.

Key changes:

Implemented proper gRPC graceful shutdown with parallel server coordination
Extended shutdown timeout from 10s to 20s to accommodate GCP Spot VM constraints
Added Envoy coordination mechanism through /internal/shutdown-ready endpoint

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
pkg/rpc/server.go	Core gRPC graceful shutdown implementation with HTTP/gRPC coordination and Envoy endpoint
pkg/web/cmd/server/server.go	Parallel gRPC server shutdown with wait groups and structured shutdown sequence
pkg/api/cmd/server.go	Graceful shutdown coordination for API service with proper server ordering
pkg/batch/cmd/server/server.go	Batch service shutdown improvements with parallel server handling
pkg/subscriber/cmd/server/server.go	Subscriber service shutdown optimization for PubSub message processing
pkg/metrics/metrics.go	Updated Prometheus collectors and error handling for metrics server shutdown
manifests/bucketeer/charts/*/templates/deployment.yaml	Kubernetes deployment updates with termination grace periods and Envoy preStop hooks
manifests/bucketeer/charts/api/values.yaml	Health check probe timing adjustments for faster readiness detection
manifests/bucketeer/charts/api/templates/envoy-configmap.yaml	Extended Envoy timeout values to accommodate graceful shutdown timing

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

pkg/metrics/metrics.go

pkg/rpc/server.go

pkg/web/cmd/server/server.go

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

Copilot

Pull Request Overview

Copilot reviewed 30 out of 30 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

pkg/api/cmd/server.go:378

This line appears to be incorrect - it should close auditLogClient but the variable name suggests it should be closing autoOpsClient based on the context and the variable being assigned on line 387.

	defer auditLogClient.Close()

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

cre8ivejp · 2025-10-20T07:19:12Z

manifests/bucketeer/charts/api/templates/deployment.yaml

+                    # Wait for load balancer propagation (must match app container propagation delay)
+                    sleep 15
+                    wget -q -T 1 -O- --method=POST --body-data='' \
+                      "http://localhost:${admin_port}/drain_listeners?graceful" || true
+                    exit 0


This will ensure that the envoy container waits until the LB stops sending new requests before shutting down gracefully the in-flight requests.

cre8ivejp · 2025-10-20T09:20:25Z

pkg/api/cmd/server.go

+	serverCtx, serverCtxCancel := context.WithCancel(context.Background())
+	defer serverCtxCancel()
+	if err := apiGateway.Start(serverCtx, gatewayHandler); err != nil {


Bascailly, this and the envoy preStop script was the issue.
The envoy preStop was not waiting for the api gateway to shut down gracefully because the netstat command was removed from envoy image.

Also, the was an issue in the api container using the app context, so at the moment the app gets the sigterm, it starts sending GOAWAY frames to client, which cases 499 errors. we must use a different ctx so we can manage the shut down gracefully.

nnnkkk7

Thanks!
I left a few comments!

watch_20251015-214530

pkg/api/cmd/server.go

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

nnnkkk7

Nice work!
LGTM!!

Ubisoft-potato

Great work for solving graceful shut down!
Thanks! 👍

cre8ivejp added 14 commits October 9, 2025 10:31

fix: improve timeout handling and health check resilience during traf…

b8337b9

…fic spikes Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

chore: set mising timeout settings for api server

e616e3d

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

chore: adjust timeout settings

286b464

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

feat: add metrics to monitor grpc server shutdown

e2f1699

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

chore: implement prometheus push gateway

5618c14

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

fix: metrics service name

59fcb8e

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

chore: set prometheusPushGatewayURL for all services

0a4e7d3

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

fix: deprecated collector functions

d4e3fe1

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

fix: grouping label conflict

72941f4

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

chore: split server and service labels

d8bc12e

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

fix: missing server label

2649024

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

chore: remove shutdown metrics

c8c4fa3

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

chore: rebase

6047110

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

fix: gracefull shutdown for all services

cb90076

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

cre8ivejp requested a review from Copilot October 9, 2025 02:30

Copilot AI reviewed Oct 9, 2025

View reviewed changes

pkg/metrics/metrics.go Show resolved Hide resolved

pkg/rpc/server.go Outdated Show resolved Hide resolved

pkg/web/cmd/server/server.go Outdated Show resolved Hide resolved

cre8ivejp added 2 commits October 9, 2025 11:38

chore: set the timeout for k8s readiness and liveness

95e1943

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

fix: lint error

abe9498

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

cre8ivejp marked this pull request as ready for review October 9, 2025 03:16

cre8ivejp requested review from hvn2k1, nnnkkk7 and t-kikuc as code owners October 9, 2025 03:16

fix: shutting down process

b66f813

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

cre8ivejp force-pushed the fix-gracefull-shutdown branch 2 times, most recently from fdb962a to f393032 Compare October 9, 2025 07:05

fix: shutdown order

cfdfa18

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

cre8ivejp force-pushed the fix-gracefull-shutdown branch from f393032 to cfdfa18 Compare October 9, 2025 07:57

cre8ivejp added 3 commits October 9, 2025 17:40

chore: remove drain_listeners from envoy prestop

cc11c98

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

fix: 503 errors when shutting down the server

1f51d03

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

chore: remove internal shutdown ready handler

e910a7b

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

cre8ivejp force-pushed the fix-gracefull-shutdown branch from b9349c8 to 6994f4b Compare October 16, 2025 14:26

fix: health check response when getting sigterm

4b641a6

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

cre8ivejp force-pushed the fix-gracefull-shutdown branch from 6994f4b to 4b641a6 Compare October 16, 2025 14:41

chore: change retry count

2e31c9a

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

cre8ivejp force-pushed the fix-gracefull-shutdown branch from eb84e32 to 2e31c9a Compare October 16, 2025 16:36

cre8ivejp added 2 commits October 17, 2025 19:05

chore: remove envoy healthcheck fail call

eebcfd1

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

chore: sleep before stopping health check

1d75ac9

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

cre8ivejp force-pushed the fix-gracefull-shutdown branch 5 times, most recently from a453319 to 41961ae Compare October 18, 2025 11:47

chore: change api agteway server context

f6d8894

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

cre8ivejp force-pushed the fix-gracefull-shutdown branch from 41961ae to f6d8894 Compare October 18, 2025 12:14

cre8ivejp requested a review from Copilot October 19, 2025 05:57

Copilot AI reviewed Oct 19, 2025

View reviewed changes

chore: change envoy pre stop script

88765b9

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

cre8ivejp commented Oct 20, 2025

View reviewed changes

cre8ivejp marked this pull request as ready for review October 20, 2025 07:40

cre8ivejp requested review from Ubisoft-potato, hvn2k1 and t-kikuc October 20, 2025 07:40

cre8ivejp commented Oct 20, 2025

View reviewed changes

nnnkkk7 reviewed Oct 20, 2025

View reviewed changes

watch_20251015-214530 Outdated Show resolved Hide resolved

pkg/api/cmd/server.go Show resolved Hide resolved

chore: remove duplicate defer functions

25b3cbb

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>

nnnkkk7 approved these changes Oct 20, 2025

View reviewed changes

Ubisoft-potato approved these changes Oct 20, 2025

View reviewed changes

cre8ivejp merged commit c78bb7e into main Oct 20, 2025
17 checks passed

cre8ivejp deleted the fix-gracefull-shutdown branch October 20, 2025 14:22

bucketeer-bot mentioned this pull request Oct 9, 2025

chore(main): release 2.2.0 #2142

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: app and envoy container not shutting down gracefully #2155

fix: app and envoy container not shutting down gracefully #2155

Uh oh!

cre8ivejp commented Oct 9, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

cre8ivejp Oct 20, 2025

Uh oh!

cre8ivejp Oct 20, 2025

Uh oh!

nnnkkk7 left a comment

Uh oh!

Uh oh!

Uh oh!

nnnkkk7 left a comment

Uh oh!

Ubisoft-potato left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: app and envoy container not shutting down gracefully #2155

fix: app and envoy container not shutting down gracefully #2155

Uh oh!

Conversation

cre8ivejp commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

cre8ivejp Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

cre8ivejp Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

nnnkkk7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nnnkkk7 left a comment

Choose a reason for hiding this comment

Uh oh!

Ubisoft-potato left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cre8ivejp commented Oct 9, 2025 •

edited

Loading