Skip to content

Implemented health probe for cleaner shutdown #31866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

eugene-uniswap
Copy link

@eugene-uniswap eugene-uniswap commented May 21, 2025

Problem

When op-geth is getting restarted, we need to signal load balancer to stop sending requests to a node before shutting it down, otherwise we will have errors in the RPC. The proper way to signal load balancer is to implement srver side readiness probe endpoint, which will fail when receiving SIGTERM, allow load balancer to stop sending traffic to the node, then after some reasonable timeout, drain inflight requests and exit.

This PR Implements functionality to

  • fail /readyz health check when node shutdown is initiated.
  • allow 7 second for existing requests to drain before shutting down RPC servers and before database is closed.

Tests

Testing was done on a local WS by launching op-geth, hitting /readyz on RPC port with fortio, interrupting op-geth with Ctrl-C and watching for readyz probe to fail while op-geth is waiting for 7 seconds before shutting down the server.

fortio load -c 1 -qps 1 -allow-initial-errors -n 10000 http://127.0.0.1:8551/readyz

17:50:26.384 r1 [WRN] http_client.go:1217> Content-length missing, header_len=64, thread=0, run=0
17:50:27.384 r1 [WRN] http_client.go:1217> Content-length missing, header_len=64, thread=0, run=0
17:50:28.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:29.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:30.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:31.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:32.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:33.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:34.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:35.383 r1 [ERR] http_client.go:954> Unable to connect, dest={"IP":"127.0.0.1","Port":8551,"Zone":""}, err="dial tcp 127.0.0.1:8551: connect: connection refused", numfd=6, thread=0, run=0
17:50:36.385 r1 [ERR] http_client.go:954> Unable to connect, dest={"IP":"127.0.0.1","Port":8551,"Zone":""}, err="dial tcp 127.0.0.1:8551: connect: connection refused", numfd=6, thread=0, run=0

op-geth output:

^CINFO [04-22|17:50:28.036] Got interrupt, shutting down...
INFO [04-22|17:50:35.037] HTTP server stopped                      endpoint=127.0.0.1:8551
INFO [04-22|17:50:35.038] IPC endpoint closed                      url=/Users/eugene.aleynikov/Library/Ethereum/geth.ipc
INFO [04-22|17:50:35.155] Ethereum protocol stopped
INFO [04-22|17:50:35.155] Transaction pool stopped
INFO [04-22|17:50:35.204] Persisting dirty state to disk           root=d7f897..0f0544 layers=0
INFO [04-22|17:50:35.205] Persisted dirty state to disk            size=74.00B elapsed="264.834µs"
INFO [04-22|17:50:35.236] Blockchain stopped

Additional context

  • This PR is needed for the work to add backend healthcheck into proxyd.
  • When launching op-geth via Docker Compose, the default stop_grace_period is 10 seconds. This should be increased to at least 15 seconds—or ideally 30 seconds—to align with the default values used by Kubernetes and ECS. Otherwise, the container risks being forcefully terminated with a SIGKILL before the drain logic has a chance to complete, which will cause unclean shutdown and possible on-disk data corruption.

UPDATE
more testing after making shutdown delay to be optional.

  1. Specify shutdown delay of 10 seconds and observe shutdown process (pay attention to timestamps):
    build/bin/geth --shutdown-delay 10s
^CINFO [06-12|12:57:39.058] Got interrupt, shutting down...
INFO [06-12|12:57:49.059] HTTP server stopped                      endpoint=127.0.0.1:8551
  1. Without --shutdown-delay option:
^CINFO [06-12|13:01:36.265] Got interrupt, shutting down...
INFO [06-12|13:01:36.265] HTTP server stopped                      endpoint=127.0.0.1:8551

Copy link
Contributor

@jwasinger jwasinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't appear that Geth implements these health check endpoints (?). Even if it did, it seems a bit excessive to add a blanket 7 second delay during shutdown.

@eugene-uniswap
Copy link
Author

It doesn't appear that Geth implements these health check endpoints (?). Even if it did, it seems a bit excessive to add a blanket 7 second delay during shutdown.

Here's the logic behind it:

We have implemented health check in proxyd (it will be committed upstream later), but it does not make sense to fire it on intervals shorter than 5 seconds. Running longer interval also does not make sense because it increases reaction time and slows down deployment. So 7 seconds is nearly perfect value that will be good enough for everyone, and everyone needs to use health check if they are using any kind of load balancer.

Also if people use k8s, there is no guarantee that SIGTERM will arrive after kube proxy removes node from the service. There is a race condition there. If you decide yo run Istio, it can normally have 2-3 second configuration replication lag on top of it. So k8s users will also benefit from this delay.

@fjl
Copy link
Contributor

fjl commented May 27, 2025

We don't want to add the timeout. If you need to coordinate shutdown with your loadbalancer, then your cluster management needs to first take a Geth instance out of the loadbalancer and then send the shutdown signal to it. Any solution implemented in Geth will always be racy and weird.

@fjl fjl closed this May 27, 2025
@fjl fjl reopened this May 27, 2025
@eugene-uniswap
Copy link
Author

We don't want to add the timeout. If you need to coordinate shutdown with your loadbalancer, then your cluster management needs to first take a Geth instance out of the loadbalancer and then send the shutdown signal to it. Any solution implemented in Geth will always be racy and weird.

The primary purpose of adding the health check was to signal Proxyd. Since Proxyd doesn't offer an API to remove a node from rotation, it currently relies on tracking error rates—meaning a significant number of errors can occur before it reacts. We've implemented a health check to address this, but it's a polling mechanism, with a reaction delay during which traffic must be served still. This feature is essential, and once it's fully tested, we plan to merge the health check code upstream into Proxyd.

So this PR is no go or there are some conditions to make it happen? If we make this delay to be optional via arg switch, will that work?

@fjl
Copy link
Contributor

fjl commented May 29, 2025

We don't want to add it the way it is now in the PR. It could work with a flag, like --shutdown-delay but I still think it's weird. The only way to shut down Geth is sending SIGINT to it. It already takes a significant amount of time after the signal to shut down, because Geth has to write the current state to disk and finish any long-running processing cleanly. So if we add a delay it has to be done in a way that doesn't lengthen the time further. Most process supervisors have a setting that kills the process if it takes too long to shut down and this is already a problem. I would rather make Geth exit instantly than increase the delay.

@eugene-uniswap
Copy link
Author

eugene-uniswap commented May 29, 2025

We don't want to add it the way it is now in the PR. It could work with a flag, like --shutdown-delay but I still think it's weird. The only way to shut down Geth is sending SIGINT to it. It already takes a significant amount of time after the signal to shut down, because Geth has to write the current state to disk and finish any long-running processing cleanly. So if we add a delay it has to be done in a way that doesn't lengthen the time further. Most process supervisors have a setting that kills the process if it takes too long to shut down and this is already a problem. I would rather make Geth exit instantly than increase the delay.

k8s has terminationGracePeriodSeconds which is 30 seconds by default, it can be increased no problem. ECS also has 30 second grace period. 7 seconds is not that much on a 30 second scale. We only need to care about docker compose which has 10 second default grace timeout.
@fjl If i implement --shutdown-delay and make delay to be opt-in, do I have a chance that this PR gets merged?

@fjl
Copy link
Contributor

fjl commented May 29, 2025

It is not so easy. The flag is a good step, and we will definitely not merge it without the flag. But I'm trying to explore alternative solutions also. For example, we might want to add an admin_ method for shutting down the node gracefully. I left the PR open to discuss possibilities here. We will triage this again next week in team.

@eugene-uniswap
Copy link
Author

It is not so easy. The flag is a good step, and we will definitely not merge it without the flag. But I'm trying to explore alternative solutions also. For example, we might want to add an admin_ method for shutting down the node gracefully. I left the PR open to discuss possibilities here. We will triage this again next week in team.

I'm concerned that this might be difficult to align with the container lifecycle. For example, when you update a container version, you typically revise the ECS task definition. ECS then begins the update process by sending a SIGTERM signal to one of the running containers (we don't know which one will be chosen first). At that point, you have two options:

  1. Stop accepting requests immediately, which can cause a spike in errors at the downstream load balancer. (current scenario)

  2. Signal the load balancer to stop sending new traffic first, while continuing to serve existing requests until the load balancer reacts to a health check.

If the load balancer is integrated into the platform (like ALB or kube-proxy), the first solution is generally manageable. However, when using something like proxyd, which is not platform-native, there’s no built-in mechanism to signal it—aside from the health check we're trying to implement.

There’s a reason SIGTERM and SIGKILL are spaced out by 30 seconds (by default): that window is intended to give time to resolve race conditions and ensure a clean shutdown, where the client stops sending requests first and the server stops accepting them only after that.

@JarredLese

This comment was marked as spam.

@fjl
Copy link
Contributor

fjl commented Jun 3, 2025

Yes, we will definitely consider merging this when it is behind a flag! But the flag should not be on by default.

@fjl fjl removed the status:triage label Jun 3, 2025
@fjl fjl assigned jwasinger and unassigned jwasinger Jun 3, 2025
@eugene-uniswap
Copy link
Author

eugene-uniswap commented Jun 9, 2025

Yes, we will definitely consider merging this when it is behind a flag! But the flag should not be on by default.

@fjl Added commit making delay to be optional and off by default. Did some testing (in the PR description).

@fjl fjl self-assigned this Jun 17, 2025
@eugene-uniswap
Copy link
Author

@fjl Hi Felix! What is the final conclusion? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants