-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[consensus] Update proposer metrics #19655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
3 Skipped Deployments
|
consensus/core/src/core.rs
Outdated
.add_blocks(accepted_blocks.iter().map(|b| b.reference()).collect()) | ||
// Get max round of accepted blocks. This will be equal to the threshold | ||
// clock round, either by advancing the threshold clock round by being | ||
// greater than current clock round or by equaling the current clock round. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the case? Blocks older than current threshold clock round can get accepted as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only added the case in the comment for greater and equal but blocks less than the clock round are essentially ignored by threshold clock
consensus/core/src/core.rs
Outdated
self.context | ||
.metrics | ||
.node_metrics | ||
.block_proposal_leader_wait_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use a separate metric for counting the number of times leader is not found. block_proposal_leader_wait_count
is tied to block_proposal_leader_wait_ms
, so when the average wait is ~250ms, we know the leader is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the confusion for me with these metrics is that it doesn't just include leader wait time, it includes the quorum receive wait time which can make this metric a little misleading. Separating them brings more clarity. Though I guess we could always subtract this metric from quorum receive latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at ThresholdClock::add_block()
, quorum_ts
is updated when a quorum forms. Then the block_proposal_leader_wait_ms
is calculated from quorum_ts
. This does not include quorum receive latency. Also, leader timeout task is signaled when quorum_ts
is updated, so block_proposal_leader_wait_ms
should line up with actual leader wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was trying to measure is the specific moment when the leader block is received versus when the round can advance due to quorum received. The wait time is anytime after the quorum is received as that is the minimum needed to advance the round. The leader block can be received and make up the quorum needed to advance the clock. So to accurately get the wait time for a leader block we should get the min(leader_received_ts - quorum_ts, 0)
.
What we currently have in theory should lead to the same results but after we add the block we go through the process of try_commit
which then adds to the wait time calculation for the leader. I think this is why we see the difference in the graphs. (pasting here again)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the average is calculated with block_proposal_leader_wait_ms / block_proposal_leader_wait_count, can you compare the block_proposal_leader_wait_count
rate as well? Originally the intention is to have the count similar to # of blocks proposed. With the new logic, I think it is double counting: if a block proposal has to wait for 250ms, and try_new_block()
gets called after waiting for 50ms, 100ms, 150ms and 200ms, then the average leader wait seems to be skewed higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you are correct using leader count was incorrect in the new metric but comparing it looks like the wait is indeed lower still
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to keep the block_proposal_interval
metric and postpone the refactor to block_proposal_leader_wait_ms
and block_proposal_leader_wait_count
. We can chat offline on the goal of the refactor and the best way to achieve that.
consensus/core/src/core.rs
Outdated
self.context | ||
.metrics | ||
.node_metrics | ||
.block_proposal_leader_wait_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at ThresholdClock::add_block()
, quorum_ts
is updated when a quorum forms. Then the block_proposal_leader_wait_ms
is calculated from quorum_ts
. This does not include quorum receive latency. Also, leader timeout task is signaled when quorum_ts
is updated, so block_proposal_leader_wait_ms
should line up with actual leader wait.
&["authority"], | ||
registry, | ||
).unwrap(), | ||
block_proposal_interval: register_histogram_with_registry!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric is useful.
Ordering::Greater => { | ||
self.aggregator.clear(); | ||
self.aggregator.add(block.author, &self.context.committee); | ||
if proposal_leaders_exist { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel if we want to monitor the time between round N having quorum and round N leader exists, doing it in core is more natural, and it is almost equivalent to how block_proposal_leader_wait_ms
is computed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on that. Thresholdclock starts now having some responsibility which doesn't seem that coherent/related and probably Core might be a better place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the changes to threshold clock and block_proposal_leader_wait_ms. Will keep what we have for now
9ce12a0
to
3752116
Compare
Description
Add metric for the interval between propsals.
Test plan
private-testnet
Release notes
Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.