Optimize cluster failure report #2277

sungming2 · 2025-06-26T09:31:15Z

Summary

The original implementation used a simple list to track failure reports, which made core operations (update, delete, and cleanup) run in O(N) time. This became a serious bottleneck when many nodes failed simultaneously, as each new report or cleanup operation required scanning the entire list regardless of whether the reports were expired or still valid. This led to severe performance degradation under high failure scenarios due to repeated full list scans and inefficient lookups.

Key changes

This PR replaces the legacy fail_reports list with a new report implementation that uses radix tree to manage failure reports more efficiently and robustly.
This is the simplest and most targeted fix: we keep just the radix tree to maintain sorted reports. This avoids adding a new composite structure or changing the core failure report API. It directly addresses the main bottleneck: clusterNodeCleanupFailureReports() by enabling fast expiration with minimal code change. it’s sufficient and easy to maintain.
To reduce memory overhead and excessive node splits caused by millisecond level keys, we need to round expiry timestamps up to the nearest second. This time bucketing keeps rax structure compact enough.

Performance Test

The test is performed on m7g.2xlarge with 2000 nodes cluster (1000 primaries/1000 replicas)

Original implementation shows 100% CPU utilization during 300 nodes failover. In this case, cleanupFailreReports accounts for around 60% of the total CPU usage.
Current implemenation shows ~30% CPU utilization during 450 nodes failover. (Tested multiple times)

codecov · 2025-06-26T09:46:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.29%. Comparing base (c782b5a) to head (d62fc10).
⚠️ Report is 19 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2277      +/-   ##
============================================
- Coverage     71.41%   71.29%   -0.13%     
============================================
  Files           123      123              
  Lines         67092    67160      +68     
============================================
- Hits          47913    47879      -34     
- Misses        19179    19281     +102

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.82% <100.00%> (+0.02%)`	⬆️

... and 25 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

hpatro

High level cursory thought:

If we plan to introduce Linked List + hashtable combination to reduce the time complexity, I would prefer to be independent API(s) to consume. Currently, it looks quite easy to introduce bugs and difficult to unit test.

sarthakaggarwal97

Took an initial look at the approach. It's nice to see the improvement in time complexity but I am unsure how to test it.

src/cluster_legacy.c

src/cluster_legacy.h

hpatro · 2025-07-01T20:19:25Z

@sungming2 I was alluding to introducing a separate data structure file, let's say lru_cache.h/lru_cache.c and use it from methods in cluster file. Through that we could add unit test to the add/remove functionality of lru_cache.c/lru_cache.h.

The API(s) could be:

lruNew, lruPut, lruGet, lruDelete, lruFree

hpatro · 2025-07-01T21:58:09Z

@sungming2 I was alluding to introducing a separate data structure file, let's say lru_cache.h/lru_cache.c and use it from methods in cluster file. Through that we could add unit test to the add/remove functionality of lru_cache.c/lru_cache.h.

The API(s) could be:

lruNew, lruPut, lruGet, lruDelete, lruFree

@madolson / @sarthakaggarwal97 do you agree with this approach ? I agree with @sungming2's approach about the usage of this data structure. I couldn't think of any other alternative to address the high CPU utilization.

sarthakaggarwal97 · 2025-07-02T17:31:53Z

@madolson / @sarthakaggarwal97 do you agree with this approach ? I agree with @sungming2's approach about the usage of this data structure. I couldn't think of any other alternative to address the high CPU utilization.

yeah the data structures look fine to me. I had this idea where we can use expiry timestamp as the key in dictionary, the values could be the list of nodes and evict based on current time. But I haven't put a lot of thought if this can address the problem.

sarthakaggarwal97

Took a look at the APIs. I might have to think more if we can break something we these API but dropped few comments for you till then. Thanks for following up on this,

src/cluster_legacy.c

src/expiry_set.c

src/unit/test_expiry_set.c

src/expiry_set.c

src/expiry_set.h

Signed-off-by: Seungmin Lee <sungming@amazon.com>

hpatro · 2025-07-18T20:11:44Z

@sungming2 is currently evaluating if vset would do the job. We will get back soon on this.

sungming2 · 2025-07-22T20:19:54Z

There have been multiple solutions discussed for improving failure report handling. Each has different trade offs in terms of complexity, maintainability, and performance:

List (original failure report)
A simple linked list of reports, but operations like refresh, removal, and expiration all require full scans (O(N)), making it inefficient at scale.
Dict + List
Adds a dictionary for O(1) lookup, update, and removal of failure reports. The list still holds reports in insertion order, so sorting is done during insertions. While it improves access performance, maintaining both structures adds code complexity.
Dict + RAX
Similar to (dict+list), but uses a radix tree to maintain sorted order automatically. This reduces code for ordering and makes expiration scans more efficient. Still requires managing two structures.
Dict + VSET (vset link)
Similar to (dict+rax), vset uses hybrid structure that internally switches between vector, rax, and hash representations based on number of entries with timebucket mechanism. While promising in theory, the API is restrictive, and it’s currently optimized for hash field expirations. May not generalize well for our case without additional API work and it still needs two structures.
RAX only
This is the simplest and most targeted fix: we keep just the radix tree to maintain sorted reports. This avoids adding a new composite structure or changing the core failure report API. It directly addresses the main bottleneck: clusterNodeCleanupFailureReports() by enabling fast expiration with minimal code change. it’s sufficient and easy to maintain.
To reduce memory overhead and excessive node splits caused by millisecond level keys, we need to round expiry timestamps up to the nearest second. This time bucketing keeps rax structure compact enough.

In short, I think rax only is a minimal, targeted fix that improves cleanup efficiency while keeping the implementation clean and focused.

Operation	list (original)	rax	dict + list	dict + radix
Add new report	O(1) (add tail)	O(L)=O(1) (raxInsert)	O(1) (dict lookup) O(N) (scan list backward worst‑case) O(1) (list insert/delete) O(1) (dictReplace) Overall: O(N) worst‑case, O(1) monotonic-case	O(1) (dict lookup) O(L) (raxInsert) O(1) (dictAdd) Overall: O(L)=O(1)
Refresh existing report	O(N) (full-scan update)	O(N) (full-scan update)	O(1) (dict lookup) O(1) (list insert/delete) O(1) (dictReplace) Overall: O(1)	O(1) (dict lookup) O(L) (raxRemove) O(L) (raxInsert) O(1) (dictReplace) Overall: O(L)=O(1)
Remove a report early	O(N) (full-scan remove)	O(N) (full-scan remove)	O(1) (dict lookup) O(1) (listDelNode) O(1) (dictDelete) Overall: O(1)	O(1) (dict lookup) O(L) (raxRemove) O(1) (dictDelete) Overall: O(L)=O(1)
Count reports	O(N) (expire reports) O(1) (listSize) Overall: O(N)	O(E) (expire reports) O(1) (rax size) Overall: O(E)	O(E) (expire reports) O(1) (list length) Overall: O(E)	O(LE) = O(E) (expire reports) O(1) (dictSize) Overall:* O(E)
Expire old reports	O(N) (full-scan expiration)	O(E) (scan/remove E items) Overall: O(E)	O(E) (scan/remove E items) Overall: O(E)	O(L) per expired entry (next+remove E items) Overall: O(L*E)=O(E)
CPU Utilization (450 failover)	100%	35%	30%	32%

L = length of the key for radix = 16bytes (8 expiry + 8 node)

hpatro · 2025-07-22T21:09:16Z

Thanks @sungming2 for the thorough analysis #2277 (comment). Great work here, thanks for the patience.

RAX only
This is the simplest and most targeted fix: we keep just the radix tree to maintain sorted reports. This avoids adding a new composite structure or changing the core failure report API. It directly addresses the main bottleneck: clusterNodeCleanupFailureReports() by enabling fast expiration with minimal code change. it’s sufficient and easy to maintain.
To reduce memory overhead and excessive node splits caused by millisecond level keys, we need to round expiry timestamps up to the nearest second. This time bucketing keeps rax structure compact enough.

I'm aligned to take the RAX only approach forward. The trick which helped us avoid the CPU utilization spike is by getting the failure report ordered via RAX and then by rounding off to the nearest second helped with grouping and help cleanup faster. The change will be also very limited in scope.

@sarthakaggarwal97 / @madolson Please share your thoughts.

sarthakaggarwal97 · 2025-07-22T22:07:34Z

The approach with using just RAX sounds good to me too. It is simple, doesn't require adding an additional custom data structure and keeps the diff small. Also, rather than nearest second (lower or upper), I would probably like the next nearest second (just upper), so that if there is a margin of error for expiry, better give more time to avoid an extra cycle to achieve a quorum.

src/cluster_legacy.c

sarthakaggarwal97

The new implementation looks so simple! Thanks for churning through different approaches @sungming2. Some minor comments from me.

src/cluster_legacy.c

Signed-off-by: Seungmin Lee <sungming@amazon.com>

src/cluster_legacy.c

Signed-off-by: Seungmin Lee <sungming@amazon.com>

src/cluster_legacy.c

sarthakaggarwal97 · 2025-07-24T22:20:44Z

@sungming2 please change the PR description to reflect the current implementation of the PR when you get a chance.

hpatro

Mostly LGTM.

Could we update the failure-marking.tcl test to verify there is no failure report left at the end?

We could add the following to the test Only primary with slots has the right to mark a node as failed

# Check there are no failure reports left.
wait_for_condition 1000 50 {
    [R 0 CLUSTER COUNT-FAILURE-REPORTS $replica_id] == 0 &&
    [R 2 CLUSTER COUNT-FAILURE-REPORTS $replica_id] == 0 &&
    [R 3 CLUSTER COUNT-FAILURE-REPORTS $replica_id] == 0 &&
    [R 4 CLUSTER COUNT-FAILURE-REPORTS $replica_id] == 0
} else {
    fail "Cluster COUNT-FAILURE-REPORTS is not right."
}

src/cluster_legacy.c

Signed-off-by: Seungmin Lee <sungming@amazon.com>

hpatro

LGTM.

For others reviewing it, the failure report now is rounded up to the nearest second (ceil). So for the cleanup we will delay it by maximum of a second which seems reasonable to me. However, there is no delay in failover.

hpatro · 2025-07-25T18:36:55Z

@madolson / @enjoy-binbin Would be nice if one of you could take a pass.

madolson

Mostly nitpicks and a comment about 32bit decodes. I'll also throw on the run extra tests so we get a 32bit test run.

src/cluster_legacy.c

Signed-off-by: Seungmin Lee <sungming@amazon.com>

sarthakaggarwal97 · 2025-07-29T18:09:30Z

src/cluster_legacy.c

+    const size_t node_ptr_pad_bytes = (sizeof(clusterNode *) == 4) ? 4 : 0; // pad on 32-bit
+
+    /* Round up to the next second for fewer key splits and quorum grace */
+    mstime_t bucketed_time = (report_time / SEC_IN_MS) * SEC_IN_MS + SEC_IN_MS;


I am thinking if we can ignore the hour part and tens of minutes in the timestamp (HH:MM:SS) to create fewer nodes in the radix tree. In some corner cases, say if the report for a failed node is received at 12:59:59, then next report will have a lot of digits changed. if it sounds valid, I can create an issue for this as well.

wdyt @hpatro @sungming2?

sungming2 changed the title ~~Failure report~~ Improve performance bottleneck from cluster failure report under heavy failure load Jun 26, 2025

sungming2 changed the title ~~Improve performance bottleneck from cluster failure report under heavy failure load~~ Optimize cluster failure report handling under heavy failure load Jun 26, 2025

sungming2 changed the title ~~Optimize cluster failure report handling under heavy failure load~~ Optimize cluster failure report Jun 26, 2025

sungming2 marked this pull request as ready for review June 26, 2025 09:49

sungming2 requested review from madolson, hpatro and sarthakaggarwal97 June 26, 2025 19:15

sungming2 force-pushed the failure_report branch from bb02490 to 3e10dcc Compare June 26, 2025 20:11

hpatro reviewed Jun 30, 2025

View reviewed changes

sarthakaggarwal97 reviewed Jun 30, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.h Outdated Show resolved Hide resolved

sungming2 commented Jul 1, 2025

View reviewed changes

src/cluster_legacy.h Outdated Show resolved Hide resolved

sarthakaggarwal97 added the cluster label Jul 1, 2025

sungming2 force-pushed the failure_report branch from 37eebf8 to a6fdc86 Compare July 2, 2025 08:35

hpatro requested review from sarthakaggarwal97 and hpatro July 2, 2025 20:12

sarthakaggarwal97 reviewed Jul 3, 2025

View reviewed changes

sarthakaggarwal97 reviewed Jul 4, 2025

View reviewed changes

src/expiry_set.c Outdated Show resolved Hide resolved

hpatro reviewed Jul 8, 2025

View reviewed changes

src/expiry_set.c Outdated Show resolved Hide resolved

src/expiry_set.c Outdated Show resolved Hide resolved

src/expiry_set.c Outdated Show resolved Hide resolved

src/expiry_set.c Outdated Show resolved Hide resolved

src/expiry_set.h Outdated Show resolved Hide resolved

Seungmin Lee added 8 commits July 10, 2025 03:17

Initial commit

040b341

Signed-off-by: Seungmin Lee <sungming@amazon.com>

Add missing free

d7f8ca2

Signed-off-by: Seungmin Lee <sungming@amazon.com>

Remove unnecessary logs

45a67fa

Signed-off-by: Seungmin Lee <sungming@amazon.com>

Add cleanup failure report to clear node failure

0fa9b83

Signed-off-by: Seungmin Lee <sungming@amazon.com>

Rename

c239c42

Signed-off-by: Seungmin Lee <sungming@amazon.com>

Rewrite comments

439d651

Signed-off-by: Seungmin Lee <sungming@amazon.com>

Rewrite comments

6129c1e

Signed-off-by: Seungmin Lee <sungming@amazon.com>

Revert untouched code

2bf803e

Signed-off-by: Seungmin Lee <sungming@amazon.com>

sungming2 force-pushed the failure_report branch from 5c7960a to f918977 Compare July 23, 2025 22:38

hpatro reviewed Jul 23, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

sarthakaggarwal97 reviewed Jul 23, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Show resolved Hide resolved

hpatro reviewed Jul 23, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

sarthakaggarwal97 reviewed Jul 23, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

hpatro reviewed Jul 23, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

Rax only

34cacc9

Signed-off-by: Seungmin Lee <sungming@amazon.com>

sungming2 force-pushed the failure_report branch from f918977 to 34cacc9 Compare July 23, 2025 23:07

sarthakaggarwal97 reviewed Jul 23, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

Address comments

5eebebb

Signed-off-by: Seungmin Lee <sungming@amazon.com>

hpatro reviewed Jul 24, 2025

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

hpatro reviewed Jul 24, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

sarthakaggarwal97 approved these changes Jul 24, 2025

View reviewed changes

Seungmin Lee added 2 commits July 24, 2025 17:56

Address comment

d61fad2

Signed-off-by: Seungmin Lee <sungming@amazon.com>

Add additional failure report test to verify cleanup

78ff7ac

Signed-off-by: Seungmin Lee <sungming@amazon.com>

sarthakaggarwal97 requested a review from zuiderkwast July 25, 2025 17:37

hpatro approved these changes Jul 25, 2025

View reviewed changes

madolson reviewed Jul 25, 2025

View reviewed changes

madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jul 25, 2025

Address comments

d62fc10

Signed-off-by: Seungmin Lee <sungming@amazon.com>

hpatro merged commit 2a44506 into valkey-io:unstable Jul 28, 2025
98 of 102 checks passed

hpatro mentioned this pull request Jul 28, 2025

Create padded pointer copy utility for handling 32 bit systems #2383

Closed

sungming2 mentioned this pull request Jul 28, 2025

Add helper function for padded pointer copy #2388

Merged

sarthakaggarwal97 reviewed Jul 29, 2025

View reviewed changes

Optimize cluster failure report #2277

Optimize cluster failure report #2277

Uh oh!

Conversation

sungming2 commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Performance Test

Uh oh!

codecov bot commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hpatro commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hpatro commented Jul 1, 2025

Uh oh!

sarthakaggarwal97 commented Jul 2, 2025

Uh oh!

sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hpatro commented Jul 18, 2025

Uh oh!

sungming2 commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hpatro commented Jul 22, 2025

Uh oh!

sarthakaggarwal97 commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarthakaggarwal97 commented Jul 24, 2025

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hpatro left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

sungming2 commented Jun 26, 2025 •

edited

Loading

codecov bot commented Jun 26, 2025 •

edited

Loading

hpatro commented Jul 1, 2025 •

edited

Loading

sungming2 commented Jul 22, 2025 •

edited

Loading

sarthakaggarwal97 commented Jul 22, 2025 •

edited

Loading

hpatro left a comment •

edited

Loading