[consensus] Update proposer metrics #19655

arun-koshy · 2024-10-02T06:38:20Z

Description

Add metric for the interval between propsals.

Test plan

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

vercel · 2024-10-02T06:38:24Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
sui-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Oct 22, 2024 11:04pm

3 Skipped Deployments

Name	Status	Preview	Updated (UTC)
multisig-toolkit	⬜️ Ignored (Inspect)	Visit Preview	Oct 22, 2024 11:04pm
sui-kiosk	⬜️ Ignored (Inspect)	Visit Preview	Oct 22, 2024 11:04pm
sui-typescript-docs	⬜️ Ignored (Inspect)	Visit Preview	Oct 22, 2024 11:04pm

arun-koshy · 2024-10-02T18:01:06Z

All of the changes to the metrics can be seen reflected on the left side of the graphs and the right side is "main". Let me know what you think.

We can see the rate at which we have to call back in to try new block because leaders were missing.
We can see the the average wait time for a leader AFTER a quorum has been reached which is around 3ms. With this we will have the quorum receive latency + leader wait time separated to show us which is taking most of the time.
And this is what block proposal interval will look like.

mwtian · 2024-10-02T18:25:02Z

consensus/core/src/core.rs

-            .add_blocks(accepted_blocks.iter().map(|b| b.reference()).collect())
+        // Get max round of accepted blocks. This will be equal to the threshold
+        // clock round, either by advancing the threshold clock round by being
+        // greater than current clock round or by equaling the current clock round.


Is this the case? Blocks older than current threshold clock round can get accepted as well.

I only added the case in the comment for greater and equal but blocks less than the clock round are essentially ignored by threshold clock

mwtian · 2024-10-02T18:31:23Z

consensus/core/src/core.rs

+                self.context
+                    .metrics
+                    .node_metrics
+                    .block_proposal_leader_wait_count


I think we should use a separate metric for counting the number of times leader is not found. block_proposal_leader_wait_count is tied to block_proposal_leader_wait_ms, so when the average wait is ~250ms, we know the leader is missing.

I think the confusion for me with these metrics is that it doesn't just include leader wait time, it includes the quorum receive wait time which can make this metric a little misleading. Separating them brings more clarity. Though I guess we could always subtract this metric from quorum receive latency.

Looking at ThresholdClock::add_block(), quorum_ts is updated when a quorum forms. Then the block_proposal_leader_wait_ms is calculated from quorum_ts. This does not include quorum receive latency. Also, leader timeout task is signaled when quorum_ts is updated, so block_proposal_leader_wait_ms should line up with actual leader wait.

What I was trying to measure is the specific moment when the leader block is received versus when the round can advance due to quorum received. The wait time is anytime after the quorum is received as that is the minimum needed to advance the round. The leader block can be received and make up the quorum needed to advance the clock. So to accurately get the wait time for a leader block we should get the min(leader_received_ts - quorum_ts, 0).

What we currently have in theory should lead to the same results but after we add the block we go through the process of try_commit which then adds to the wait time calculation for the leader. I think this is why we see the difference in the graphs. (pasting here again)

[left is the new metric and right is main]

Is the average is calculated with block_proposal_leader_wait_ms / block_proposal_leader_wait_count, can you compare the block_proposal_leader_wait_count rate as well? Originally the intention is to have the count similar to # of blocks proposed. With the new logic, I think it is double counting: if a block proposal has to wait for 250ms, and try_new_block() gets called after waiting for 50ms, 100ms, 150ms and 200ms, then the average leader wait seems to be skewed higher.

Yeah you are correct using leader count was incorrect in the new metric but comparing it looks like the wait is indeed lower still

https://metrics.sui.io/goto/mZanqCiHg?orgId=1

mwtian

My suggestion is to keep the block_proposal_interval metric and postpone the refactor to block_proposal_leader_wait_ms and block_proposal_leader_wait_count. We can chat offline on the goal of the refactor and the best way to achieve that.

mwtian · 2024-10-22T04:10:03Z

consensus/core/src/core.rs

+                self.context
+                    .metrics
+                    .node_metrics
+                    .block_proposal_leader_wait_count


Looking at ThresholdClock::add_block(), quorum_ts is updated when a quorum forms. Then the block_proposal_leader_wait_ms is calculated from quorum_ts. This does not include quorum receive latency. Also, leader timeout task is signaled when quorum_ts is updated, so block_proposal_leader_wait_ms should line up with actual leader wait.

consensus/core/src/core.rs

mwtian · 2024-10-22T04:13:00Z

consensus/core/src/metrics.rs

                &["authority"],
                registry,
            ).unwrap(),
+            block_proposal_interval: register_histogram_with_registry!(


This metric is useful.

mwtian · 2024-10-22T04:16:56Z

consensus/core/src/threshold_clock.rs

            Ordering::Greater => {
                self.aggregator.clear();
                self.aggregator.add(block.author, &self.context.committee);
+                if proposal_leaders_exist {


I feel if we want to monitor the time between round N having quorum and round N leader exists, doing it in core is more natural, and it is almost equivalent to how block_proposal_leader_wait_ms is computed.

I agree on that. Thresholdclock starts now having some responsibility which doesn't seem that coherent/related and probably Core might be a better place.

Reverted the changes to threshold clock and block_proposal_leader_wait_ms. Will keep what we have for now

add proposer metrics

75e9363

arun-koshy requested review from akichidis and mwtian October 2, 2024 06:38

mwtian reviewed Oct 2, 2024

View reviewed changes

arun-koshy mentioned this pull request Oct 7, 2024

[consensus] Exclude low scoring ancestors in proposals with time based inclusion/exclusion #19605

Merged

8 tasks

mwtian self-requested a review October 21, 2024 16:55

mwtian reviewed Oct 22, 2024

View reviewed changes

vercel bot deployed to Preview – sui-docs October 22, 2024 23:01 View deployment

address review comments

3752116

arun-koshy force-pushed the ak/proposer-metrics branch from 9ce12a0 to 3752116 Compare October 22, 2024 23:03

vercel bot deployed to Preview – sui-docs October 22, 2024 23:04 View deployment

mwtian approved these changes Oct 23, 2024

View reviewed changes

arun-koshy merged commit cb0a21a into main Oct 23, 2024
52 checks passed

arun-koshy deleted the ak/proposer-metrics branch October 23, 2024 04:51

[consensus] Update proposer metrics #19655

[consensus] Update proposer metrics #19655

Uh oh!

Conversation

arun-koshy commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test plan

Release notes

Uh oh!

vercel bot commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arun-koshy commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwtian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arun-koshy commented Oct 2, 2024 •

edited

Loading

vercel bot commented Oct 2, 2024 •

edited

Loading

arun-koshy commented Oct 2, 2024 •

edited

Loading