Skip to content

Conversation

@cre8ivejp
Copy link
Member

@cre8ivejp cre8ivejp commented Oct 9, 2025

Fix #2154

The HTTP server is shutting down gracefully, but the gRPC is not, causing requests to fail instantly when the pods get the SIGTERM signal.


This pull request introduces several improvements to the health check endpoints, readiness/liveness/startup probes, and connection handling for rolling updates in the Bucketeer API deployment. The main goals are to support more robust health signaling, ensure smoother rolling updates, and improve reliability for clients during deployments. Below are the most important changes grouped by theme.

Health Check and Readiness Improvements

  • Added /ready endpoint to both API and web backends in nginx/bucketeer.conf for readiness checks, alongside the existing /health endpoint. [1] [2]
  • Updated Kubernetes deployment probes to use /ready for readiness, liveness, and startup checks, replacing /health where appropriate. [1] [2] [3] [4]
  • Improved health check configuration in backend-config.yaml, adding interval and threshold settings for more frequent and responsive health checks.

Rolling Update and Connection Handling

  • Reduced Envoy idle, stream, and request timeouts from 620s/30s to 60s to force clients to reconnect during rolling updates, preventing stale connections and 499 errors.
  • Enhanced preStop hook logic in the deployment to gracefully drain Envoy listeners and allow time for load balancer propagation, improving pod termination behavior.

Timeouts and Retry Policies

  • Increased route timeouts for all API and web service endpoints from 35s to 60s in Envoy config, and raised retry attempts from 1 to 3 for better resilience under transient failures. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

Deployment Configuration Enhancements

  • Added configurable terminationGracePeriodSeconds and strategy options to the deployment spec for more flexible rollout and shutdown behavior. [1] [2]

Let me know if you have questions about how these changes affect deployment, readiness checks, or rolling update reliability!
This pull request introduces several improvements to health check handling, graceful shutdown, and timeout configuration for Bucketeer's API and batch services. The main focus is on adding /ready readiness endpoints, enhancing shutdown scripts to wait for active requests, and increasing timeouts for web service routes. These changes aim to improve reliability during deployments and ensure smoother traffic handling.

Health check and readiness improvements:

  • Added /ready readiness endpoints alongside /health in both NGINX configuration (bucketeer.conf) and Envoy config, allowing Kubernetes and other orchestrators to distinguish between liveness and readiness checks. [1] [2] Fedcfaa8L224R224)
  • Updated Kubernetes deployment manifests for both API and batch services to use /ready for readiness probes and /health for liveness probes, aligning with best practices. (Fe59bce0L161R161, F9765c1cL197R197)

Graceful shutdown enhancements:

  • Improved preStop lifecycle shutdown scripts in both API and batch deployments to wait for active requests to complete (using Envoy stats) before terminating, instead of just waiting for all TCP connections to close. This helps prevent dropped requests during rolling updates. (Fe59bce0L175R175, F9765c1cL213R213)

Timeout and probe configuration:

  • Increased timeouts for all web service routes in Envoy from 35s to 45s to reduce the likelihood of premature request termination. (Fedcfaa8L316R316, Fedcfaa8L329R329, Fedcfaa8L342R342, Fedcfaa8L355R355, Fedcfaa8L368R368, Fedcfaa8L381R381, Fedcfaa8L394R394, Fedcfaa8L407R407, Fedcfaa8L420R420, Fedcfaa8L446R446, Fedcfaa8L459R459)
  • Adjusted liveness and readiness probe parameters (e.g., periods, thresholds, timeouts) in the Helm values file for more responsive and robust health checking.

Other configuration updates:

  • Made terminationGracePeriodSeconds configurable (default 30s) in both API and batch deployments for more graceful pod shutdowns. [1] [2]
  • Added a deprecation notice for the grpc-web filter in Envoy, indicating plans for future removal.

…fic spikes

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
@cre8ivejp cre8ivejp requested a review from Copilot October 9, 2025 02:30
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes graceful shutdown of gRPC servers to prevent request failures when pods receive SIGTERM signals. The current issue was that while HTTP servers shut down gracefully, gRPC servers did not, causing instant failures during pod termination.

Key changes:

  • Implemented proper gRPC graceful shutdown with parallel server coordination
  • Extended shutdown timeout from 10s to 20s to accommodate GCP Spot VM constraints
  • Added Envoy coordination mechanism through /internal/shutdown-ready endpoint

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/rpc/server.go Core gRPC graceful shutdown implementation with HTTP/gRPC coordination and Envoy endpoint
pkg/web/cmd/server/server.go Parallel gRPC server shutdown with wait groups and structured shutdown sequence
pkg/api/cmd/server.go Graceful shutdown coordination for API service with proper server ordering
pkg/batch/cmd/server/server.go Batch service shutdown improvements with parallel server handling
pkg/subscriber/cmd/server/server.go Subscriber service shutdown optimization for PubSub message processing
pkg/metrics/metrics.go Updated Prometheus collectors and error handling for metrics server shutdown
manifests/bucketeer/charts/*/templates/deployment.yaml Kubernetes deployment updates with termination grace periods and Envoy preStop hooks
manifests/bucketeer/charts/api/values.yaml Health check probe timing adjustments for faster readiness detection
manifests/bucketeer/charts/api/templates/envoy-configmap.yaml Extended Envoy timeout values to accommodate graceful shutdown timing

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
@cre8ivejp cre8ivejp marked this pull request as ready for review October 9, 2025 03:16
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
@cre8ivejp cre8ivejp force-pushed the fix-gracefull-shutdown branch 2 times, most recently from fdb962a to f393032 Compare October 9, 2025 07:05
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
@cre8ivejp cre8ivejp force-pushed the fix-gracefull-shutdown branch from f393032 to cfdfa18 Compare October 9, 2025 07:57
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
@cre8ivejp cre8ivejp force-pushed the fix-gracefull-shutdown branch from b9349c8 to 6994f4b Compare October 16, 2025 14:26
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
@cre8ivejp cre8ivejp force-pushed the fix-gracefull-shutdown branch from 6994f4b to 4b641a6 Compare October 16, 2025 14:41
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
@cre8ivejp cre8ivejp force-pushed the fix-gracefull-shutdown branch from eb84e32 to 2e31c9a Compare October 16, 2025 16:36
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
@cre8ivejp cre8ivejp force-pushed the fix-gracefull-shutdown branch 5 times, most recently from a453319 to 41961ae Compare October 18, 2025 11:47
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
@cre8ivejp cre8ivejp force-pushed the fix-gracefull-shutdown branch from 41961ae to f6d8894 Compare October 18, 2025 12:14
@cre8ivejp cre8ivejp requested a review from Copilot October 19, 2025 05:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 30 out of 30 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

pkg/api/cmd/server.go:378

  • This line appears to be incorrect - it should close auditLogClient but the variable name suggests it should be closing autoOpsClient based on the context and the variable being assigned on line 387.
	defer auditLogClient.Close()

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Comment on lines +188 to +192
# Wait for load balancer propagation (must match app container propagation delay)
sleep 15
wget -q -T 1 -O- --method=POST --body-data='' \
"http://localhost:${admin_port}/drain_listeners?graceful" || true
exit 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will ensure that the envoy container waits until the LB stops sending new requests before shutting down gracefully the in-flight requests.

@cre8ivejp cre8ivejp marked this pull request as ready for review October 20, 2025 07:40
Comment on lines +547 to +549
serverCtx, serverCtxCancel := context.WithCancel(context.Background())
defer serverCtxCancel()
if err := apiGateway.Start(serverCtx, gatewayHandler); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bascailly, this and the envoy preStop script was the issue.
The envoy preStop was not waiting for the api gateway to shut down gracefully because the netstat command was removed from envoy image.

Also, the was an issue in the api container using the app context, so at the moment the app gets the sigterm, it starts sending GOAWAY frames to client, which cases 499 errors. we must use a different ctx so we can manage the shut down gracefully.

Copy link
Contributor

@nnnkkk7 nnnkkk7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
I left a few comments!

Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Copy link
Contributor

@nnnkkk7 nnnkkk7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
LGTM!!

Copy link
Collaborator

@Ubisoft-potato Ubisoft-potato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work for solving graceful shut down!
Thanks! 👍

@cre8ivejp cre8ivejp merged commit c78bb7e into main Oct 20, 2025
17 checks passed
@cre8ivejp cre8ivejp deleted the fix-gracefull-shutdown branch October 20, 2025 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: app and envoy container not shutting down gracefully

4 participants