Skip to content

Conversation

mystenmark
Copy link
Contributor

This change adds a replay-on-startup phase to checkpoint service construction.

The basic idea is that with data quarantining, if the validator crashes, it will reprocess consensus commits that had already been processed, but whose output had not yet been committed to disk. Constructed checkpoints are one of the outputs of this process.

When the node starts up, it should be in the same state it was in when it crashed. In order to do this, we need to replay all the consensus commits that had been previously processed. This will cause checkpoints to be constructed. The effect of this change is that the node will wait until these checkpoints have been reconstructed before entering normal operations.

@mystenmark mystenmark requested a review from aschran January 9, 2025 18:52
Copy link

vercel bot commented Jan 9, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 23, 2025 7:06pm
2 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Jan 23, 2025 7:06pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Jan 23, 2025 7:06pm

Copy link
Contributor

@aschran aschran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was really confused on my first read-through of mod.rs, and then once I mentally remapped some variable names it made a lot more sense. looks good overall but just some comments on naming

.await?;
sorted_tx_effects_included_in_checkpoint.extend(txn_in_checkpoint);
}
let new_checkpoint = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit- while you're in the neighborhood, maybe fix this name to new_checkpoints?

}

impl CheckpointServiceState {
fn take(&mut self) -> (CheckpointBuilder, CheckpointAggregator) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have a more desciptive name for this than take? I think the behavior (about marking it started) is surprising given the fn name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough - i named it take because I was just making a more readable version of an Option. renamed to take_unstarted

// The highest sequence number that had already been built at the time CheckpointService
// was constructed
last_built_seq: CheckpointSequenceNumber,
// A notification for each time a new checkpoint is built.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is confusing me because you are using a watch, which I don't believe is guaranteed to report every sent value to a receiver if the receiver is too slow? (but "each time" makes it seem like it is important for every value to be received?) or am I wrong about how watch works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Reworded the comment.

{
break;
} else {
debug_fatal!("Still waiting for checkpoints to be rebuilt");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it helpful to crash here even if just in debug mode? no invariant has been violated, it seems like it should be a warn! at most?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is - while testing DQ this catches various cases where the rebuilding process got stuck.

}

impl CheckpointService {
/// Waits until the last_built_seq available in last_built_rx is >= last_built_seq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is like greek to me, as far as I can tell there isn't anything called last_built_rx at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, this comment was more of a copilot prompt. fixed.

last_signature_index: Mutex<u64>,
// The highest sequence number that had already been built at the time CheckpointService
// was constructed
last_built_seq: CheckpointSequenceNumber,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat confused by this name - it isn't the "last" built sequence number, right? it's the highest-ever built sequence number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

// was constructed
last_built_seq: CheckpointSequenceNumber,
// A notification for each time a new checkpoint is built.
cur_built_tx: watch::Sender<CheckpointSequenceNumber>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if my comment about last_built_seq above is correct, then it seems like this should be called last_built_tx? (and the above should be highest_built_seq or something like that?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

@mystenmark mystenmark temporarily deployed to sui-typescript-aws-kms-test-env January 23, 2025 19:04 — with GitHub Actions Inactive
@mystenmark mystenmark enabled auto-merge (squash) January 23, 2025 19:10
@mystenmark mystenmark merged commit 75ea5e3 into main Jan 23, 2025
51 checks passed
@mystenmark mystenmark deleted the mlogan-checkpoint-refactor branch January 23, 2025 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants