- 
                Notifications
    You must be signed in to change notification settings 
- Fork 23
fix: app and envoy container not shutting down gracefully #2155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…fic spikes Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes graceful shutdown of gRPC servers to prevent request failures when pods receive SIGTERM signals. The current issue was that while HTTP servers shut down gracefully, gRPC servers did not, causing instant failures during pod termination.
Key changes:
- Implemented proper gRPC graceful shutdown with parallel server coordination
- Extended shutdown timeout from 10s to 20s to accommodate GCP Spot VM constraints
- Added Envoy coordination mechanism through /internal/shutdown-readyendpoint
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description | 
|---|---|
| pkg/rpc/server.go | Core gRPC graceful shutdown implementation with HTTP/gRPC coordination and Envoy endpoint | 
| pkg/web/cmd/server/server.go | Parallel gRPC server shutdown with wait groups and structured shutdown sequence | 
| pkg/api/cmd/server.go | Graceful shutdown coordination for API service with proper server ordering | 
| pkg/batch/cmd/server/server.go | Batch service shutdown improvements with parallel server handling | 
| pkg/subscriber/cmd/server/server.go | Subscriber service shutdown optimization for PubSub message processing | 
| pkg/metrics/metrics.go | Updated Prometheus collectors and error handling for metrics server shutdown | 
| manifests/bucketeer/charts/*/templates/deployment.yaml | Kubernetes deployment updates with termination grace periods and Envoy preStop hooks | 
| manifests/bucketeer/charts/api/values.yaml | Health check probe timing adjustments for faster readiness detection | 
| manifests/bucketeer/charts/api/templates/envoy-configmap.yaml | Extended Envoy timeout values to accommodate graceful shutdown timing | 
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
fdb962a    to
    f393032      
    Compare
  
    Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
f393032    to
    cfdfa18      
    Compare
  
    Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
b9349c8    to
    6994f4b      
    Compare
  
    Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
6994f4b    to
    4b641a6      
    Compare
  
    Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
eb84e32    to
    2e31c9a      
    Compare
  
    Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
a453319    to
    41961ae      
    Compare
  
    Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
41961ae    to
    f6d8894      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 30 out of 30 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
pkg/api/cmd/server.go:378
- This line appears to be incorrect - it should close auditLogClientbut the variable name suggests it should be closingautoOpsClientbased on the context and the variable being assigned on line 387.
	defer auditLogClient.Close()
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
| # Wait for load balancer propagation (must match app container propagation delay) | ||
| sleep 15 | ||
| wget -q -T 1 -O- --method=POST --body-data='' \ | ||
| "http://localhost:${admin_port}/drain_listeners?graceful" || true | ||
| exit 0 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will ensure that the envoy container waits until the LB stops sending new requests before shutting down gracefully the in-flight requests.
| serverCtx, serverCtxCancel := context.WithCancel(context.Background()) | ||
| defer serverCtxCancel() | ||
| if err := apiGateway.Start(serverCtx, gatewayHandler); err != nil { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bascailly, this and the envoy preStop script was the issue.
The envoy preStop was not waiting for the api gateway to shut down gracefully because the netstat command was removed from envoy image.
Also, the was an issue in the api container using the app context, so at the moment the app gets the sigterm, it starts sending GOAWAY frames to client, which cases 499 errors. we must use a different ctx so we can manage the shut down gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I left a few comments!
Signed-off-by: Alessandro Yuichi Okimoto <yuichijpn@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
LGTM!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work for solving graceful shut down!
Thanks! 👍
Fix #2154
The HTTP server is shutting down gracefully, but the gRPC is not, causing requests to fail instantly when the pods get the SIGTERM signal.
This pull request introduces several improvements to the health check endpoints, readiness/liveness/startup probes, and connection handling for rolling updates in the Bucketeer API deployment. The main goals are to support more robust health signaling, ensure smoother rolling updates, and improve reliability for clients during deployments. Below are the most important changes grouped by theme.
Health Check and Readiness Improvements
/readyendpoint to both API and web backends innginx/bucketeer.conffor readiness checks, alongside the existing/healthendpoint. [1] [2]/readyfor readiness, liveness, and startup checks, replacing/healthwhere appropriate. [1] [2] [3] [4]backend-config.yaml, adding interval and threshold settings for more frequent and responsive health checks.Rolling Update and Connection Handling
Timeouts and Retry Policies
Deployment Configuration Enhancements
terminationGracePeriodSecondsand strategy options to the deployment spec for more flexible rollout and shutdown behavior. [1] [2]Let me know if you have questions about how these changes affect deployment, readiness checks, or rolling update reliability!
This pull request introduces several improvements to health check handling, graceful shutdown, and timeout configuration for Bucketeer's API and batch services. The main focus is on adding
/readyreadiness endpoints, enhancing shutdown scripts to wait for active requests, and increasing timeouts for web service routes. These changes aim to improve reliability during deployments and ensure smoother traffic handling.Health check and readiness improvements:
/readyreadiness endpoints alongside/healthin both NGINX configuration (bucketeer.conf) and Envoy config, allowing Kubernetes and other orchestrators to distinguish between liveness and readiness checks. [1] [2] Fedcfaa8L224R224)/readyfor readiness probes and/healthfor liveness probes, aligning with best practices. (Fe59bce0L161R161, F9765c1cL197R197)Graceful shutdown enhancements:
Timeout and probe configuration:
Other configuration updates:
terminationGracePeriodSecondsconfigurable (default 30s) in both API and batch deployments for more graceful pod shutdowns. [1] [2]