Skip to content

Conversation

arun-koshy
Copy link
Contributor

@arun-koshy arun-koshy commented Oct 2, 2024

Description

Add metric for the interval between propsals.

Test plan

private-testnet


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:
  • REST API:

@arun-koshy arun-koshy requested review from akichidis and mwtian October 2, 2024 06:38
Copy link

vercel bot commented Oct 2, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 22, 2024 11:04pm
3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Oct 22, 2024 11:04pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Oct 22, 2024 11:04pm
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview Oct 22, 2024 11:04pm

@arun-koshy
Copy link
Contributor Author

arun-koshy commented Oct 2, 2024

All of the changes to the metrics can be seen reflected on the left side of the graphs and the right side is "main". Let me know what you think.

  • We can see the rate at which we have to call back in to try new block because leaders were missing.
    Screenshot 2024-10-02 at 10 55 09 AM
  • We can see the the average wait time for a leader AFTER a quorum has been reached which is around 3ms. With this we will have the quorum receive latency + leader wait time separated to show us which is taking most of the time.
    Screenshot 2024-10-02 at 10 55 37 AM
  • And this is what block proposal interval will look like.
    Screenshot 2024-10-02 at 10 55 53 AM

.add_blocks(accepted_blocks.iter().map(|b| b.reference()).collect())
// Get max round of accepted blocks. This will be equal to the threshold
// clock round, either by advancing the threshold clock round by being
// greater than current clock round or by equaling the current clock round.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the case? Blocks older than current threshold clock round can get accepted as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added the case in the comment for greater and equal but blocks less than the clock round are essentially ignored by threshold clock

self.context
.metrics
.node_metrics
.block_proposal_leader_wait_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use a separate metric for counting the number of times leader is not found. block_proposal_leader_wait_count is tied to block_proposal_leader_wait_ms, so when the average wait is ~250ms, we know the leader is missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the confusion for me with these metrics is that it doesn't just include leader wait time, it includes the quorum receive wait time which can make this metric a little misleading. Separating them brings more clarity. Though I guess we could always subtract this metric from quorum receive latency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at ThresholdClock::add_block(), quorum_ts is updated when a quorum forms. Then the block_proposal_leader_wait_ms is calculated from quorum_ts. This does not include quorum receive latency. Also, leader timeout task is signaled when quorum_ts is updated, so block_proposal_leader_wait_ms should line up with actual leader wait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was trying to measure is the specific moment when the leader block is received versus when the round can advance due to quorum received. The wait time is anytime after the quorum is received as that is the minimum needed to advance the round. The leader block can be received and make up the quorum needed to advance the clock. So to accurately get the wait time for a leader block we should get the min(leader_received_ts - quorum_ts, 0).

What we currently have in theory should lead to the same results but after we add the block we go through the process of try_commit which then adds to the wait time calculation for the leader. I think this is why we see the difference in the graphs. (pasting here again)

[left is the new metric and right is main]
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the average is calculated with block_proposal_leader_wait_ms / block_proposal_leader_wait_count, can you compare the block_proposal_leader_wait_count rate as well? Originally the intention is to have the count similar to # of blocks proposed. With the new logic, I think it is double counting: if a block proposal has to wait for 250ms, and try_new_block() gets called after waiting for 50ms, 100ms, 150ms and 200ms, then the average leader wait seems to be skewed higher.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are correct using leader count was incorrect in the new metric but comparing it looks like the wait is indeed lower still

https://metrics.sui.io/goto/mZanqCiHg?orgId=1

Copy link
Contributor

@mwtian mwtian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to keep the block_proposal_interval metric and postpone the refactor to block_proposal_leader_wait_ms and block_proposal_leader_wait_count. We can chat offline on the goal of the refactor and the best way to achieve that.

self.context
.metrics
.node_metrics
.block_proposal_leader_wait_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at ThresholdClock::add_block(), quorum_ts is updated when a quorum forms. Then the block_proposal_leader_wait_ms is calculated from quorum_ts. This does not include quorum receive latency. Also, leader timeout task is signaled when quorum_ts is updated, so block_proposal_leader_wait_ms should line up with actual leader wait.

&["authority"],
registry,
).unwrap(),
block_proposal_interval: register_histogram_with_registry!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is useful.

Ordering::Greater => {
self.aggregator.clear();
self.aggregator.add(block.author, &self.context.committee);
if proposal_leaders_exist {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel if we want to monitor the time between round N having quorum and round N leader exists, doing it in core is more natural, and it is almost equivalent to how block_proposal_leader_wait_ms is computed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on that. Thresholdclock starts now having some responsibility which doesn't seem that coherent/related and probably Core might be a better place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the changes to threshold clock and block_proposal_leader_wait_ms. Will keep what we have for now

@arun-koshy arun-koshy merged commit cb0a21a into main Oct 23, 2024
52 checks passed
@arun-koshy arun-koshy deleted the ak/proposer-metrics branch October 23, 2024 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants