-
Notifications
You must be signed in to change notification settings - Fork 952
Optimize cluster failure report #2277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #2277 +/- ##
============================================
- Coverage 71.41% 71.29% -0.13%
============================================
Files 123 123
Lines 67092 67160 +68
============================================
- Hits 47913 47879 -34
- Misses 19179 19281 +102
🚀 New features to boost your workflow:
|
bb02490
to
3e10dcc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level cursory thought:
If we plan to introduce Linked List + hashtable combination to reduce the time complexity, I would prefer to be independent API(s) to consume. Currently, it looks quite easy to introduce bugs and difficult to unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took an initial look at the approach. It's nice to see the improvement in time complexity but I am unsure how to test it.
@sungming2 I was alluding to introducing a separate data structure file, let's say lru_cache.h/lru_cache.c and use it from methods in cluster file. Through that we could add unit test to the add/remove functionality of lru_cache.c/lru_cache.h. The API(s) could be: lruNew, lruPut, lruGet, lruDelete, lruFree |
@madolson / @sarthakaggarwal97 do you agree with this approach ? I agree with @sungming2's approach about the usage of this data structure. I couldn't think of any other alternative to address the high CPU utilization. |
yeah the data structures look fine to me. I had this idea where we can use expiry timestamp as the key in dictionary, the values could be the list of nodes and evict based on current time. But I haven't put a lot of thought if this can address the problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at the APIs. I might have to think more if we can break something we these API but dropped few comments for you till then. Thanks for following up on this,
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
@sungming2 is currently evaluating if vset would do the job. We will get back soon on this. |
There have been multiple solutions discussed for improving failure report handling. Each has different trade offs in terms of complexity, maintainability, and performance:
In short, I think rax only is a minimal, targeted fix that improves cleanup efficiency while keeping the implementation clean and focused.
|
Thanks @sungming2 for the thorough analysis #2277 (comment). Great work here, thanks for the patience.
I'm aligned to take the RAX only approach forward. The trick which helped us avoid the CPU utilization spike is by getting the failure report ordered via RAX and then by rounding off to the nearest second helped with grouping and help cleanup faster. The change will be also very limited in scope. @sarthakaggarwal97 / @madolson Please share your thoughts. |
The approach with using just RAX sounds good to me too. It is simple, doesn't require adding an additional custom data structure and keeps the diff small. Also, rather than nearest second (lower or upper), I would probably like the next nearest second (just upper), so that if there is a margin of error for expiry, better give more time to avoid an extra cycle to achieve a quorum. |
5c7960a
to
f918977
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation looks so simple! Thanks for churning through different approaches @sungming2. Some minor comments from me.
f918977
to
34cacc9
Compare
Signed-off-by: Seungmin Lee <sungming@amazon.com>
@sungming2 please change the PR description to reflect the current implementation of the PR when you get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM.
Could we update the failure-marking.tcl test to verify there is no failure report left at the end?
We could add the following to the test Only primary with slots has the right to mark a node as failed
# Check there are no failure reports left.
wait_for_condition 1000 50 {
[R 0 CLUSTER COUNT-FAILURE-REPORTS $replica_id] == 0 &&
[R 2 CLUSTER COUNT-FAILURE-REPORTS $replica_id] == 0 &&
[R 3 CLUSTER COUNT-FAILURE-REPORTS $replica_id] == 0 &&
[R 4 CLUSTER COUNT-FAILURE-REPORTS $replica_id] == 0
} else {
fail "Cluster COUNT-FAILURE-REPORTS is not right."
}
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
For others reviewing it, the failure report now is rounded up to the nearest second (ceil). So for the cleanup we will delay it by maximum of a second which seems reasonable to me. However, there is no delay in failover.
@madolson / @enjoy-binbin Would be nice if one of you could take a pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nitpicks and a comment about 32bit decodes. I'll also throw on the run extra tests so we get a 32bit test run.
Signed-off-by: Seungmin Lee <sungming@amazon.com>
const size_t node_ptr_pad_bytes = (sizeof(clusterNode *) == 4) ? 4 : 0; // pad on 32-bit | ||
|
||
/* Round up to the next second for fewer key splits and quorum grace */ | ||
mstime_t bucketed_time = (report_time / SEC_IN_MS) * SEC_IN_MS + SEC_IN_MS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking if we can ignore the hour part and tens of minutes in the timestamp (HH:MM:SS) to create fewer nodes in the radix tree. In some corner cases, say if the report for a failed node is received at 12:59:59, then next report will have a lot of digits changed. if it sounds valid, I can create an issue for this as well.
wdyt @hpatro @sungming2?
Closes #2139
Summary
The original implementation used a simple list to track failure reports, which made core operations (update, delete, and cleanup) run in O(N) time. This became a serious bottleneck when many nodes failed simultaneously, as each new report or cleanup operation required scanning the entire list regardless of whether the reports were expired or still valid. This led to severe performance degradation under high failure scenarios due to repeated full list scans and inefficient lookups.
Key changes
This PR replaces the legacy fail_reports list with a new report implementation that uses radix tree to manage failure reports more efficiently and robustly.
This is the simplest and most targeted fix: we keep just the radix tree to maintain sorted reports. This avoids adding a new composite structure or changing the core failure report API. It directly addresses the main bottleneck: clusterNodeCleanupFailureReports() by enabling fast expiration with minimal code change. it’s sufficient and easy to maintain.
To reduce memory overhead and excessive node splits caused by millisecond level keys, we need to round expiry timestamps up to the nearest second. This time bucketing keeps rax structure compact enough.
Performance Test
The test is performed on m7g.2xlarge with 2000 nodes cluster (1000 primaries/1000 replicas)
Original implementation shows 100% CPU utilization during 300 nodes failover. In this case,



cleanupFailreReports
accounts for around 60% of the total CPU usage.Current implemenation shows ~30% CPU utilization during 450 nodes failover. (Tested multiple times)