Skip to content

[consensus] only wait to replay until highest consumer processed commit #22854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 25, 2025

Conversation

mwtian
Copy link
Contributor

@mwtian mwtian commented Jul 23, 2025

Description

Currently, CommitConsumerMonitor::replay_complete() waits until all local commits found during initialization gets processed through handle_consensus_commit(). But a suffix of local commits may not have been finalized before. These commits can only be finalized with new commits, after consensus starts running. So the waiter can wait for a long time.

This change makes the waiter only wait until the last commit index handle_consensus_commit() finished processing, which is guaranteed to be finalized in local DAG. Then the waiter should not wait for too long on restart.

Also, record finalized commit refs and their rejected transactions. They will be used during recovery to identify the commits which have enough blocks in local DAG to finalize them. In future, they will be used to verify rejected transactions are computed correctly during recovery.

Test plan

CI

TRANSACTION_DRIVER=50 python ./scripts/simtest/seed-search.py test_simulated_load_reconfig_with_crashes_and_delays --test simtest --num-seeds 500

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • gRPC:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:

Copy link

vercel bot commented Jul 23, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 25, 2025 5:26am
2 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Jul 25, 2025 5:26am
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Jul 25, 2025 5:26am

@mwtian mwtian temporarily deployed to sui-typescript-aws-kms-test-env July 23, 2025 17:09 — with GitHub Actions Inactive
@mwtian mwtian force-pushed the tmw/consensus-replay branch from b51d012 to ab47b1a Compare July 24, 2025 02:24
@mwtian mwtian temporarily deployed to sui-typescript-aws-kms-test-env July 24, 2025 02:24 — with GitHub Actions Inactive
@mwtian mwtian force-pushed the tmw/consensus-replay branch from ab47b1a to 920a29a Compare July 24, 2025 04:13
@mwtian mwtian temporarily deployed to sui-typescript-aws-kms-test-env July 24, 2025 04:13 — with GitHub Actions Inactive
@mwtian mwtian force-pushed the tmw/consensus-replay branch from 920a29a to 84af149 Compare July 24, 2025 04:15
@mwtian mwtian temporarily deployed to sui-typescript-aws-kms-test-env July 24, 2025 04:15 — with GitHub Actions Inactive
@mwtian mwtian force-pushed the tmw/consensus-replay branch from 84af149 to f7ca8b6 Compare July 24, 2025 04:45
@mwtian mwtian temporarily deployed to sui-typescript-aws-kms-test-env July 24, 2025 04:45 — with GitHub Actions Inactive
@mwtian mwtian marked this pull request as ready for review July 24, 2025 04:45
@mwtian mwtian requested a review from a team as a code owner July 24, 2025 04:45
@mwtian mwtian temporarily deployed to sui-typescript-aws-kms-test-env July 24, 2025 04:45 — with GitHub Actions Inactive
@mwtian mwtian force-pushed the tmw/consensus-replay branch from f7ca8b6 to 104aa61 Compare July 24, 2025 04:59
@mwtian mwtian temporarily deployed to sui-typescript-aws-kms-test-env July 24, 2025 04:59 — with GitHub Actions Inactive
@mwtian mwtian changed the title [consensus] only wait to replay until highest consumer observed commit [consensus] only wait to replay until highest consumer processed commit Jul 24, 2025
@mwtian mwtian requested a review from lxfind July 24, 2025 18:46
@mwtian mwtian force-pushed the tmw/consensus-replay branch from 104aa61 to 9b91e75 Compare July 25, 2025 01:45
@mwtian mwtian temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 01:45 — with GitHub Actions Inactive
@mwtian mwtian force-pushed the tmw/consensus-replay branch from 9b91e75 to 6fe223c Compare July 25, 2025 02:04
@mwtian mwtian temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 02:04 — with GitHub Actions Inactive
if committed_sub_dag.commit_ref.index
> commit_consumer.consumer_last_processed_commit_index
{
committed_sub_dag.local_dag_has_finalization_blocks = false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to mitigate incorrectly finalize transactions in recovered commits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how this fixes the issue? We only persist finalized commits to store so recovering finalized commits should be able to be reprocessed correctly. What information is missing to be able to properly finalize transactions in recovered commits?

if !finalized_commits.is_empty() {
// Commits and committed blocks must be persisted to storage before sending them to Sui
// to execute their finalized transactions.
// Commit metadata and uncommitted blocks can be persisted more lazily because they are recoverable.
// But for simplicity, all unpersisted commits and blocks are flushed to storage.
self.dag_state.write().flush();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted offline. Today for simplicity both finalized and non-finalized commits are persisted. But a suffix of commits may not have been finalized, and they don't have enough info in local dag to optimistically finalize transactions.

@lxfind lxfind requested a review from mystenmark July 25, 2025 02:14
Copy link
Contributor

@arun-koshy arun-koshy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock fixing the original issue of reducing delay in rebuilding checkpoints which looks good, but I still have an open question on how this fixes incorrect finalization

if committed_sub_dag.commit_ref.index
> commit_consumer.consumer_last_processed_commit_index
{
committed_sub_dag.local_dag_has_finalization_blocks = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how this fixes the issue? We only persist finalized commits to store so recovering finalized commits should be able to be reprocessed correctly. What information is missing to be able to properly finalize transactions in recovered commits?

if !finalized_commits.is_empty() {
// Commits and committed blocks must be persisted to storage before sending them to Sui
// to execute their finalized transactions.
// Commit metadata and uncommitted blocks can be persisted more lazily because they are recoverable.
// But for simplicity, all unpersisted commits and blocks are flushed to storage.
self.dag_state.write().flush();
}

@mwtian mwtian force-pushed the tmw/consensus-replay branch from 710a9d9 to d2b91e7 Compare July 25, 2025 05:20
@mwtian mwtian temporarily deployed to sui-typescript-aws-kms-test-env July 25, 2025 05:20 — with GitHub Actions Inactive
@mwtian mwtian merged commit bb47ac8 into main Jul 25, 2025
56 checks passed
@mwtian mwtian deleted the tmw/consensus-replay branch July 25, 2025 06:27
Copy link
Contributor

@akichidis akichidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

/// and is false when the commit is received through commit sync.
/// When this is false, CommitFinalizer will not assume there are enough blocks
/// to optimistically finalize transactions in the commit.
pub local_dag_has_finalization_blocks: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe this could be something more directly with how this sub dag has been committed? Ex:

  • committed_with_local_certificate
  • commit_certificate_exists_locally
  • leader_certificate_exists_locally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good ideas.

// & last_committed_rounds from the commits as needed.
commit_info_to_write: Vec<(CommitRef, CommitInfo)>,

// Buffers finalized commits and their rejected transactions to be written to storage.
finalized_commits_to_write: Vec<(CommitRef, BTreeMap<BlockRef, Vec<TransactionIndex>>)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intend to use somewhere the rejected transaction indexes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let me follow up with verification. Also they can be used for debugging if needed.

@@ -154,10 +154,10 @@ impl ConsensusManagerTrait for MysticetiManager {

let num_prior_commits = protocol_config.consensus_num_requested_prior_commits_at_startup();
let last_processed_commit = consensus_handler.last_processed_subdag_index() as CommitIndex;
let starting_commit = last_processed_commit.saturating_sub(num_prior_commits);
let restart_after_commit = last_processed_commit.saturating_sub(num_prior_commits);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: restart_after_commit -> replay_after_commit ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants