[sui-node] post-fork startup and metrics #22862

johnjmartin · 2025-07-24T00:02:02Z

Description

Today when a fork happens, the sui-node process just crashloops, attempting to start up but forking and panicing repeatedly until the systemd unit or kube controller stops restarting it. This makes it hard for us to know if a node in the network that someone else is running has panicked. all we know is they have stopped pushing us metrics.

This change includes the following:

When a fork first happens we commit a fork marker to the CheckpointWatermark table, then panic
Adds a check for the fork marker to node startup, supporting a "crash mode" which emits metrics and logs the error that caused the crash, but does not continue crashing

Test plan

basic unit tests

local cluster tests TODO

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

vercel · 2025-07-24T00:02:12Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
sui-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 25, 2025 5:29pm

2 Skipped Deployments

Name	Status	Preview	Comments	Updated (UTC)
multisig-toolkit	⬜️ Ignored (Inspect)	Visit Preview		Jul 25, 2025 5:29pm
sui-kiosk	⬜️ Ignored (Inspect)	Visit Preview		Jul 25, 2025 5:29pm

alex-mysten · 2025-07-25T19:41:53Z

crates/sui-node/src/admin.rs

@@ -92,6 +92,7 @@ const RANDOMNESS_INJECT_FULL_SIG_ROUTE: &str = "/randomness-inject-full-sig";
 const GET_TX_COST_ROUTE: &str = "/get-tx-cost";
 const DUMP_CONSENSUS_TX_COST_ESTIMATES_ROUTE: &str = "/dump-consensus-tx-cost-estimates";
 const TRAFFIC_CONTROL: &str = "/traffic-control";
+const CLEAR_CHECKPOINT_FORK_ROUTE: &str = "/clear-checkpoint-fork";


Would it make sense to have "/repair-fork" which accepts instructions on how to fix the fork and clears the fork state?

I suppose with my current implementation for recovery operators would need to specify the overrides in YAML, restart the validator, then call /clear-checkpoint-fork.

alex-mysten · 2025-07-25T19:48:15Z

crates/sui-core/src/checkpoints/mod.rs

@@ -622,6 +622,15 @@ impl CheckpointStore {
                ?local_contents,
                "Local checkpoint fork detected!",
            );
+
+            // Record the fork in the database before crashing
+            if let Err(e) = self.record_checkpoint_fork_detected(


What happens when we detect a fork on transaction?

alex-mysten · 2025-07-25T20:26:10Z

crates/sui-node/src/lib.rs

+            // Wait for notification from admin API that fork has been cleared
+            let mut fork_cleared_rx = checkpoint_fork_cleared_notification.subscribe();
+            tokio::select! {
+                _ = fork_cleared_rx.recv() => {


Is there a circular dependency with starting the admin server here?

In main.rs, we await start_async to set node_once_cell, then we await node_once_cell to start the admin server. If this blocks, would the admin server ever be started?

johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 24, 2025 00:02 — with GitHub Actions Inactive

johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 17:15 — with GitHub Actions Inactive

better checkpoint fork handling & metrics

7be8253

johnjmartin force-pushed the amended-fork-startup branch from 5c0769a to 2ba32bc Compare July 25, 2025 17:16

johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 17:16 — with GitHub Actions Inactive

vercel bot deployed to Preview – sui-docs July 25, 2025 17:19 View deployment

johnjmartin force-pushed the amended-fork-startup branch from 2ba32bc to 789e306 Compare July 25, 2025 17:24

johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 17:24 — with GitHub Actions Inactive

johnjmartin force-pushed the amended-fork-startup branch from 789e306 to c0772de Compare July 25, 2025 17:25

johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 17:25 — with GitHub Actions Inactive

pause node startup earlier on, add notification to admin clear

b0d34a2

johnjmartin force-pushed the amended-fork-startup branch from c0772de to b0d34a2 Compare July 25, 2025 17:27

johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 17:27 — with GitHub Actions Inactive

vercel bot deployed to Preview – sui-docs July 25, 2025 17:29 View deployment

johnjmartin changed the title ~~better checkpoint fork handling & metrics~~ [sui-node] post-fork startup and metrics Jul 25, 2025

johnjmartin marked this pull request as ready for review July 25, 2025 18:26

johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 18:26 — with GitHub Actions Inactive

johnjmartin requested review from mystenmark and alex-mysten July 25, 2025 18:59

alex-mysten reviewed Jul 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[sui-node] post-fork startup and metrics #22862

[sui-node] post-fork startup and metrics #22862

johnjmartin commented Jul 24, 2025 •

edited

Loading

Uh oh!

vercel bot commented Jul 24, 2025 •

edited

Loading

Uh oh!

alex-mysten Jul 25, 2025

Uh oh!

alex-mysten Jul 25, 2025

Uh oh!

alex-mysten Jul 25, 2025

Uh oh!

Uh oh!

[sui-node] post-fork startup and metrics #22862

Are you sure you want to change the base?

[sui-node] post-fork startup and metrics #22862

Conversation

johnjmartin commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test plan

Release notes

Uh oh!

vercel bot commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alex-mysten Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

alex-mysten Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

alex-mysten Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

johnjmartin commented Jul 24, 2025 •

edited

Loading

vercel bot commented Jul 24, 2025 •

edited

Loading