-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[sui-node] post-fork startup and metrics #22862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
2 Skipped Deployments
|
5c0769a
to
2ba32bc
Compare
2ba32bc
to
789e306
Compare
789e306
to
c0772de
Compare
c0772de
to
b0d34a2
Compare
@@ -92,6 +92,7 @@ const RANDOMNESS_INJECT_FULL_SIG_ROUTE: &str = "/randomness-inject-full-sig"; | |||
const GET_TX_COST_ROUTE: &str = "/get-tx-cost"; | |||
const DUMP_CONSENSUS_TX_COST_ESTIMATES_ROUTE: &str = "/dump-consensus-tx-cost-estimates"; | |||
const TRAFFIC_CONTROL: &str = "/traffic-control"; | |||
const CLEAR_CHECKPOINT_FORK_ROUTE: &str = "/clear-checkpoint-fork"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have "/repair-fork" which accepts instructions on how to fix the fork and clears the fork state?
I suppose with my current implementation for recovery operators would need to specify the overrides in YAML, restart the validator, then call /clear-checkpoint-fork.
@@ -622,6 +622,15 @@ impl CheckpointStore { | |||
?local_contents, | |||
"Local checkpoint fork detected!", | |||
); | |||
|
|||
// Record the fork in the database before crashing | |||
if let Err(e) = self.record_checkpoint_fork_detected( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when we detect a fork on transaction?
// Wait for notification from admin API that fork has been cleared | ||
let mut fork_cleared_rx = checkpoint_fork_cleared_notification.subscribe(); | ||
tokio::select! { | ||
_ = fork_cleared_rx.recv() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a circular dependency with starting the admin server here?
In main.rs, we await start_async
to set node_once_cell
, then we await node_once_cell
to start the admin server. If this blocks, would the admin server ever be started?
Description
Today when a fork happens, the sui-node process just crashloops, attempting to start up but forking and panicing repeatedly until the systemd unit or kube controller stops restarting it. This makes it hard for us to know if a node in the network that someone else is running has panicked. all we know is they have stopped pushing us metrics.
This change includes the following:
Test plan
basic unit tests
local cluster tests TODO
Release notes
Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.