-
Notifications
You must be signed in to change notification settings - Fork 11.6k
CheckpointService refactoring (for data quarantining) #20838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
2 Skipped Deployments
|
06ff953
to
ffc5781
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was really confused on my first read-through of mod.rs
, and then once I mentally remapped some variable names it made a lot more sense. looks good overall but just some comments on naming
.await?; | ||
sorted_tx_effects_included_in_checkpoint.extend(txn_in_checkpoint); | ||
} | ||
let new_checkpoint = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional nit- while you're in the neighborhood, maybe fix this name to new_checkpoints
?
} | ||
|
||
impl CheckpointServiceState { | ||
fn take(&mut self) -> (CheckpointBuilder, CheckpointAggregator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we have a more desciptive name for this than take
? I think the behavior (about marking it started) is surprising given the fn name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough - i named it take because I was just making a more readable version of an Option. renamed to take_unstarted
// The highest sequence number that had already been built at the time CheckpointService | ||
// was constructed | ||
last_built_seq: CheckpointSequenceNumber, | ||
// A notification for each time a new checkpoint is built. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is confusing me because you are using a watch
, which I don't believe is guaranteed to report every sent value to a receiver if the receiver is too slow? (but "each time" makes it seem like it is important for every value to be received?) or am I wrong about how watch works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Reworded the comment.
{ | ||
break; | ||
} else { | ||
debug_fatal!("Still waiting for checkpoints to be rebuilt"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it helpful to crash here even if just in debug mode? no invariant has been violated, it seems like it should be a warn!
at most?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is - while testing DQ this catches various cases where the rebuilding process got stuck.
} | ||
|
||
impl CheckpointService { | ||
/// Waits until the last_built_seq available in last_built_rx is >= last_built_seq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is like greek to me, as far as I can tell there isn't anything called last_built_rx
at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, this comment was more of a copilot prompt. fixed.
last_signature_index: Mutex<u64>, | ||
// The highest sequence number that had already been built at the time CheckpointService | ||
// was constructed | ||
last_built_seq: CheckpointSequenceNumber, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat confused by this name - it isn't the "last" built sequence number, right? it's the highest-ever built sequence number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
// was constructed | ||
last_built_seq: CheckpointSequenceNumber, | ||
// A notification for each time a new checkpoint is built. | ||
cur_built_tx: watch::Sender<CheckpointSequenceNumber>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if my comment about last_built_seq
above is correct, then it seems like this should be called last_built_tx
? (and the above should be highest_built_seq
or something like that?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
ffc5781
to
f18192c
Compare
This change adds a replay-on-startup phase to checkpoint service construction.
The basic idea is that with data quarantining, if the validator crashes, it will reprocess consensus commits that had already been processed, but whose output had not yet been committed to disk. Constructed checkpoints are one of the outputs of this process.
When the node starts up, it should be in the same state it was in when it crashed. In order to do this, we need to replay all the consensus commits that had been previously processed. This will cause checkpoints to be constructed. The effect of this change is that the node will wait until these checkpoints have been reconstructed before entering normal operations.