Skip to content

Implemented health probe for cleaner shutdown #31866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

eugene-uniswap
Copy link

@eugene-uniswap eugene-uniswap commented May 21, 2025

Problem

When op-geth is getting restarted, we need to signal load balancer to stop sending requests to a node before shutting it down, otherwise we will have errors in the RPC. The proper way to signal load balancer is to implement srver side readiness probe endpoint, which will fail when receiving SIGTERM, allow load balancer to stop sending traffic to the node, then after some reasonable timeout, drain inflight requests and exit.

This PR Implements functionality to

  • fail /readyz health check when node shutdown is initiated.
  • allow 7 second for existing requests to drain before shutting down RPC servers and before database is closed.

Tests

Testing was done on a local WS by launching op-geth, hitting /readyz on RPC port with fortio, interrupting op-geth with Ctrl-C and watching for readyz probe to fail while op-geth is waiting for 7 seconds before shutting down the server.

fortio load -c 1 -qps 1 -allow-initial-errors -n 10000 http://127.0.0.1:8551/readyz

17:50:26.384 r1 [WRN] http_client.go:1217> Content-length missing, header_len=64, thread=0, run=0
17:50:27.384 r1 [WRN] http_client.go:1217> Content-length missing, header_len=64, thread=0, run=0
17:50:28.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:29.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:30.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:31.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:32.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:33.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:34.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:35.383 r1 [ERR] http_client.go:954> Unable to connect, dest={"IP":"127.0.0.1","Port":8551,"Zone":""}, err="dial tcp 127.0.0.1:8551: connect: connection refused", numfd=6, thread=0, run=0
17:50:36.385 r1 [ERR] http_client.go:954> Unable to connect, dest={"IP":"127.0.0.1","Port":8551,"Zone":""}, err="dial tcp 127.0.0.1:8551: connect: connection refused", numfd=6, thread=0, run=0

op-geth output:

^CINFO [04-22|17:50:28.036] Got interrupt, shutting down...
INFO [04-22|17:50:35.037] HTTP server stopped                      endpoint=127.0.0.1:8551
INFO [04-22|17:50:35.038] IPC endpoint closed                      url=/Users/eugene.aleynikov/Library/Ethereum/geth.ipc
INFO [04-22|17:50:35.155] Ethereum protocol stopped
INFO [04-22|17:50:35.155] Transaction pool stopped
INFO [04-22|17:50:35.204] Persisting dirty state to disk           root=d7f897..0f0544 layers=0
INFO [04-22|17:50:35.205] Persisted dirty state to disk            size=74.00B elapsed="264.834µs"
INFO [04-22|17:50:35.236] Blockchain stopped

Additional context

  • This PR is needed for the work to add backend healthcheck into proxyd.
  • When launching op-geth via Docker Compose, the default stop_grace_period is 10 seconds. This should be increased to at least 15 seconds—or ideally 30 seconds—to align with the default values used by Kubernetes and ECS. Otherwise, the container risks being forcefully terminated with a SIGKILL before the drain logic has a chance to complete, which will cause unclean shutdown and possible on-disk data corruption.

Copy link
Contributor

@jwasinger jwasinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't appear that Geth implements these health check endpoints (?). Even if it did, it seems a bit excessive to add a blanket 7 second delay during shutdown.

@eugene-uniswap
Copy link
Author

It doesn't appear that Geth implements these health check endpoints (?). Even if it did, it seems a bit excessive to add a blanket 7 second delay during shutdown.

Here's the logic behind it:

We have implemented health check in proxyd (it will be committed upstream later), but it does not make sense to fire it on intervals shorter than 5 seconds. Running longer interval also does not make sense because it increases reaction time and slows down deployment. So 7 seconds is nearly perfect value that will be good enough for everyone, and everyone needs to use health check if they are using any kind of load balancer.

Also if people use k8s, there is no guarantee that SIGTERM will arrive after kube proxy removes node from the service. There is a race condition there. If you decide yo run Istio, it can normally have 2-3 second configuration replication lag on top of it. So k8s users will also benefit from this delay.

@fjl
Copy link
Contributor

fjl commented May 27, 2025

We don't want to add the timeout. If you need to coordinate shutdown with your loadbalancer, then your cluster management needs to first take a Geth instance out of the loadbalancer and then send the shutdown signal to it. Any solution implemented in Geth will always be racy and weird.

@fjl fjl closed this May 27, 2025
@fjl fjl reopened this May 27, 2025
@eugene-uniswap
Copy link
Author

We don't want to add the timeout. If you need to coordinate shutdown with your loadbalancer, then your cluster management needs to first take a Geth instance out of the loadbalancer and then send the shutdown signal to it. Any solution implemented in Geth will always be racy and weird.

The primary purpose of adding the health check was to signal Proxyd. Since Proxyd doesn't offer an API to remove a node from rotation, it currently relies on tracking error rates—meaning a significant number of errors can occur before it reacts. We've implemented a health check to address this, but it's a polling mechanism, with a reaction delay during which traffic must be served still. This feature is essential, and once it's fully tested, we plan to merge the health check code upstream into Proxyd.

So this PR is no go or there are some conditions to make it happen? If we make this delay to be optional via arg switch, will that work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants