Skip to content

[sui-node] post-fork startup and metrics #22862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

johnjmartin
Copy link
Contributor

@johnjmartin johnjmartin commented Jul 24, 2025

Description

Today when a fork happens, the sui-node process just crashloops, attempting to start up but forking and panicing repeatedly until the systemd unit or kube controller stops restarting it. This makes it hard for us to know if a node in the network that someone else is running has panicked. all we know is they have stopped pushing us metrics.

This change includes the following:

  • When a fork first happens we commit a fork marker to the CheckpointWatermark table, then panic
  • Adds a check for the fork marker to node startup, supporting a "crash mode" which emits metrics and logs the error that caused the crash, but does not continue crashing

Test plan

basic unit tests

local cluster tests TODO


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • gRPC:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:

@johnjmartin johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 24, 2025 00:02 — with GitHub Actions Inactive
Copy link

vercel bot commented Jul 24, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 25, 2025 5:29pm
2 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Jul 25, 2025 5:29pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Jul 25, 2025 5:29pm

@johnjmartin johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 17:15 — with GitHub Actions Inactive
@johnjmartin johnjmartin force-pushed the amended-fork-startup branch from 5c0769a to 2ba32bc Compare July 25, 2025 17:16
@johnjmartin johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 17:16 — with GitHub Actions Inactive
@johnjmartin johnjmartin force-pushed the amended-fork-startup branch from 2ba32bc to 789e306 Compare July 25, 2025 17:24
@johnjmartin johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 17:24 — with GitHub Actions Inactive
@johnjmartin johnjmartin force-pushed the amended-fork-startup branch from 789e306 to c0772de Compare July 25, 2025 17:25
@johnjmartin johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 17:25 — with GitHub Actions Inactive
@johnjmartin johnjmartin force-pushed the amended-fork-startup branch from c0772de to b0d34a2 Compare July 25, 2025 17:27
@johnjmartin johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 17:27 — with GitHub Actions Inactive
@johnjmartin johnjmartin changed the title better checkpoint fork handling & metrics [sui-node] post-fork startup and metrics Jul 25, 2025
@johnjmartin johnjmartin marked this pull request as ready for review July 25, 2025 18:26
@johnjmartin johnjmartin temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 18:26 — with GitHub Actions Inactive
@@ -92,6 +92,7 @@ const RANDOMNESS_INJECT_FULL_SIG_ROUTE: &str = "/randomness-inject-full-sig";
const GET_TX_COST_ROUTE: &str = "/get-tx-cost";
const DUMP_CONSENSUS_TX_COST_ESTIMATES_ROUTE: &str = "/dump-consensus-tx-cost-estimates";
const TRAFFIC_CONTROL: &str = "/traffic-control";
const CLEAR_CHECKPOINT_FORK_ROUTE: &str = "/clear-checkpoint-fork";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have "/repair-fork" which accepts instructions on how to fix the fork and clears the fork state?

I suppose with my current implementation for recovery operators would need to specify the overrides in YAML, restart the validator, then call /clear-checkpoint-fork.

@@ -622,6 +622,15 @@ impl CheckpointStore {
?local_contents,
"Local checkpoint fork detected!",
);

// Record the fork in the database before crashing
if let Err(e) = self.record_checkpoint_fork_detected(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when we detect a fork on transaction?

// Wait for notification from admin API that fork has been cleared
let mut fork_cleared_rx = checkpoint_fork_cleared_notification.subscribe();
tokio::select! {
_ = fork_cleared_rx.recv() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a circular dependency with starting the admin server here?

In main.rs, we await start_async to set node_once_cell, then we await node_once_cell to start the admin server. If this blocks, would the admin server ever be started?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants