Implemented health probe for cleaner shutdown #31866

eugene-uniswap · 2025-05-21T00:24:08Z

Problem

When op-geth is getting restarted, we need to signal load balancer to stop sending requests to a node before shutting it down, otherwise we will have errors in the RPC. The proper way to signal load balancer is to implement srver side readiness probe endpoint, which will fail when receiving SIGTERM, allow load balancer to stop sending traffic to the node, then after some reasonable timeout, drain inflight requests and exit.

This PR Implements functionality to

fail /readyz health check when node shutdown is initiated.
allow 7 second for existing requests to drain before shutting down RPC servers and before database is closed.

Tests

Testing was done on a local WS by launching op-geth, hitting /readyz on RPC port with fortio, interrupting op-geth with Ctrl-C and watching for readyz probe to fail while op-geth is waiting for 7 seconds before shutting down the server.

fortio load -c 1 -qps 1 -allow-initial-errors -n 10000 http://127.0.0.1:8551/readyz

17:50:26.384 r1 [WRN] http_client.go:1217> Content-length missing, header_len=64, thread=0, run=0
17:50:27.384 r1 [WRN] http_client.go:1217> Content-length missing, header_len=64, thread=0, run=0
17:50:28.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:29.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:30.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:31.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:32.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:33.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:34.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:35.383 r1 [ERR] http_client.go:954> Unable to connect, dest={"IP":"127.0.0.1","Port":8551,"Zone":""}, err="dial tcp 127.0.0.1:8551: connect: connection refused", numfd=6, thread=0, run=0
17:50:36.385 r1 [ERR] http_client.go:954> Unable to connect, dest={"IP":"127.0.0.1","Port":8551,"Zone":""}, err="dial tcp 127.0.0.1:8551: connect: connection refused", numfd=6, thread=0, run=0

op-geth output:

^CINFO [04-22|17:50:28.036] Got interrupt, shutting down...
INFO [04-22|17:50:35.037] HTTP server stopped                      endpoint=127.0.0.1:8551
INFO [04-22|17:50:35.038] IPC endpoint closed                      url=/Users/eugene.aleynikov/Library/Ethereum/geth.ipc
INFO [04-22|17:50:35.155] Ethereum protocol stopped
INFO [04-22|17:50:35.155] Transaction pool stopped
INFO [04-22|17:50:35.204] Persisting dirty state to disk           root=d7f897..0f0544 layers=0
INFO [04-22|17:50:35.205] Persisted dirty state to disk            size=74.00B elapsed="264.834µs"
INFO [04-22|17:50:35.236] Blockchain stopped

Additional context

This PR is needed for the work to add backend healthcheck into proxyd.
When launching op-geth via Docker Compose, the default stop_grace_period is 10 seconds. This should be increased to at least 15 seconds—or ideally 30 seconds—to align with the default values used by Kubernetes and ECS. Otherwise, the container risks being forcefully terminated with a SIGKILL before the drain logic has a chance to complete, which will cause unclean shutdown and possible on-disk data corruption.

UPDATE
more testing after making shutdown delay to be optional.

Specify shutdown delay of 10 seconds and observe shutdown process (pay attention to timestamps):
build/bin/geth --shutdown-delay 10s

^CINFO [06-12|12:57:39.058] Got interrupt, shutting down...
INFO [06-12|12:57:49.059] HTTP server stopped                      endpoint=127.0.0.1:8551

Without --shutdown-delay option:

^CINFO [06-12|13:01:36.265] Got interrupt, shutting down...
INFO [06-12|13:01:36.265] HTTP server stopped                      endpoint=127.0.0.1:8551

jwasinger

It doesn't appear that Geth implements these health check endpoints (?). Even if it did, it seems a bit excessive to add a blanket 7 second delay during shutdown.

eugene-uniswap · 2025-05-21T10:01:09Z

It doesn't appear that Geth implements these health check endpoints (?). Even if it did, it seems a bit excessive to add a blanket 7 second delay during shutdown.

Here's the logic behind it:

We have implemented health check in proxyd (it will be committed upstream later), but it does not make sense to fire it on intervals shorter than 5 seconds. Running longer interval also does not make sense because it increases reaction time and slows down deployment. So 7 seconds is nearly perfect value that will be good enough for everyone, and everyone needs to use health check if they are using any kind of load balancer.

Also if people use k8s, there is no guarantee that SIGTERM will arrive after kube proxy removes node from the service. There is a race condition there. If you decide yo run Istio, it can normally have 2-3 second configuration replication lag on top of it. So k8s users will also benefit from this delay.

fjl · 2025-05-27T12:35:18Z

We don't want to add the timeout. If you need to coordinate shutdown with your loadbalancer, then your cluster management needs to first take a Geth instance out of the loadbalancer and then send the shutdown signal to it. Any solution implemented in Geth will always be racy and weird.

eugene-uniswap · 2025-05-28T19:13:20Z

We don't want to add the timeout. If you need to coordinate shutdown with your loadbalancer, then your cluster management needs to first take a Geth instance out of the loadbalancer and then send the shutdown signal to it. Any solution implemented in Geth will always be racy and weird.

The primary purpose of adding the health check was to signal Proxyd. Since Proxyd doesn't offer an API to remove a node from rotation, it currently relies on tracking error rates—meaning a significant number of errors can occur before it reacts. We've implemented a health check to address this, but it's a polling mechanism, with a reaction delay during which traffic must be served still. This feature is essential, and once it's fully tested, we plan to merge the health check code upstream into Proxyd.

So this PR is no go or there are some conditions to make it happen? If we make this delay to be optional via arg switch, will that work?

fjl · 2025-05-29T14:00:20Z

We don't want to add it the way it is now in the PR. It could work with a flag, like --shutdown-delay but I still think it's weird. The only way to shut down Geth is sending SIGINT to it. It already takes a significant amount of time after the signal to shut down, because Geth has to write the current state to disk and finish any long-running processing cleanly. So if we add a delay it has to be done in a way that doesn't lengthen the time further. Most process supervisors have a setting that kills the process if it takes too long to shut down and this is already a problem. I would rather make Geth exit instantly than increase the delay.

eugene-uniswap · 2025-05-29T16:54:27Z

We don't want to add it the way it is now in the PR. It could work with a flag, like --shutdown-delay but I still think it's weird. The only way to shut down Geth is sending SIGINT to it. It already takes a significant amount of time after the signal to shut down, because Geth has to write the current state to disk and finish any long-running processing cleanly. So if we add a delay it has to be done in a way that doesn't lengthen the time further. Most process supervisors have a setting that kills the process if it takes too long to shut down and this is already a problem. I would rather make Geth exit instantly than increase the delay.

k8s has terminationGracePeriodSeconds which is 30 seconds by default, it can be increased no problem. ECS also has 30 second grace period. 7 seconds is not that much on a 30 second scale. We only need to care about docker compose which has 10 second default grace timeout.
@fjl If i implement --shutdown-delay and make delay to be opt-in, do I have a chance that this PR gets merged?

fjl · 2025-05-29T20:19:06Z

It is not so easy. The flag is a good step, and we will definitely not merge it without the flag. But I'm trying to explore alternative solutions also. For example, we might want to add an admin_ method for shutting down the node gracefully. I left the PR open to discuss possibilities here. We will triage this again next week in team.

eugene-uniswap · 2025-05-29T21:20:40Z

It is not so easy. The flag is a good step, and we will definitely not merge it without the flag. But I'm trying to explore alternative solutions also. For example, we might want to add an admin_ method for shutting down the node gracefully. I left the PR open to discuss possibilities here. We will triage this again next week in team.

I'm concerned that this might be difficult to align with the container lifecycle. For example, when you update a container version, you typically revise the ECS task definition. ECS then begins the update process by sending a SIGTERM signal to one of the running containers (we don't know which one will be chosen first). At that point, you have two options:

Stop accepting requests immediately, which can cause a spike in errors at the downstream load balancer. (current scenario)
Signal the load balancer to stop sending new traffic first, while continuing to serve existing requests until the load balancer reacts to a health check.

If the load balancer is integrated into the platform (like ALB or kube-proxy), the first solution is generally manageable. However, when using something like proxyd, which is not platform-native, there’s no built-in mechanism to signal it—aside from the health check we're trying to implement.

There’s a reason SIGTERM and SIGKILL are spaced out by 30 seconds (by default): that window is intended to give time to resolve race conditions and ensure a clean shutdown, where the client stops sending requests first and the server stops accepting them only after that.

fjl · 2025-06-03T12:32:40Z

Yes, we will definitely consider merging this when it is behind a flag! But the flag should not be on by default.

eugene-uniswap · 2025-06-09T19:44:38Z

Yes, we will definitely consider merging this when it is behind a flag! But the flag should not be on by default.

@fjl Added commit making delay to be optional and off by default. Did some testing (in the PR description).

eugene-uniswap · 2025-06-23T17:21:51Z

@fjl Hi Felix! What is the final conclusion? Thanks!

Implemented health probe for cleaner shutdown

6489bec

eugene-uniswap requested a review from fjl as a code owner May 21, 2025 00:24

eugene-uniswap mentioned this pull request May 21, 2025

Cleaner node shutdown logic ethereum-optimism/op-geth#584

Open

jwasinger reviewed May 21, 2025

View reviewed changes

MariusVanDerWijden added the status:triage label May 21, 2025

fjl closed this May 27, 2025

fjl reopened this May 27, 2025

This comment was marked as spam.

Sign in to view

fjl removed the status:triage label Jun 3, 2025

fjl assigned jwasinger and unassigned jwasinger Jun 3, 2025

eugene-uniswap added 2 commits June 12, 2025 10:18

Merge branch 'master' into clean-shutdown

d64b952

Make shutdown delay to be conditional and off by default

8b06db6

fjl self-assigned this Jun 17, 2025

Merge branch 'master' into clean-shutdown

8b2b0af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implemented health probe for cleaner shutdown #31866

Implemented health probe for cleaner shutdown #31866

Uh oh!

eugene-uniswap commented May 21, 2025 •

edited

Loading

Uh oh!

jwasinger left a comment

Uh oh!

eugene-uniswap commented May 21, 2025

Uh oh!

fjl commented May 27, 2025

Uh oh!

eugene-uniswap commented May 28, 2025

Uh oh!

fjl commented May 29, 2025 •

edited

Loading

Uh oh!

eugene-uniswap commented May 29, 2025 •

edited

Loading

Uh oh!

fjl commented May 29, 2025

Uh oh!

eugene-uniswap commented May 29, 2025

Uh oh!

This comment was marked as spam.

fjl commented Jun 3, 2025

Uh oh!

eugene-uniswap commented Jun 9, 2025 •

edited

Loading

Uh oh!

eugene-uniswap commented Jun 23, 2025

Uh oh!

Uh oh!

Implemented health probe for cleaner shutdown #31866

Are you sure you want to change the base?

Implemented health probe for cleaner shutdown #31866

Uh oh!

Conversation

eugene-uniswap commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jwasinger left a comment

Choose a reason for hiding this comment

Uh oh!

eugene-uniswap commented May 21, 2025

Uh oh!

fjl commented May 27, 2025

Uh oh!

eugene-uniswap commented May 28, 2025

Uh oh!

fjl commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eugene-uniswap commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjl commented May 29, 2025

Uh oh!

eugene-uniswap commented May 29, 2025

Uh oh!

This comment was marked as spam.

fjl commented Jun 3, 2025

Uh oh!

eugene-uniswap commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eugene-uniswap commented Jun 23, 2025

Uh oh!

Uh oh!

eugene-uniswap commented May 21, 2025 •

edited

Loading

fjl commented May 29, 2025 •

edited

Loading

eugene-uniswap commented May 29, 2025 •

edited

Loading

eugene-uniswap commented Jun 9, 2025 •

edited

Loading