Skip to content

Optimize cluster failure report #2277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jul 28, 2025
Merged

Conversation

sungming2
Copy link
Contributor

@sungming2 sungming2 commented Jun 26, 2025

Closes #2139

Summary

The original implementation used a simple list to track failure reports, which made core operations (update, delete, and cleanup) run in O(N) time. This became a serious bottleneck when many nodes failed simultaneously, as each new report or cleanup operation required scanning the entire list regardless of whether the reports were expired or still valid. This led to severe performance degradation under high failure scenarios due to repeated full list scans and inefficient lookups.

Key changes

This PR replaces the legacy fail_reports list with a new report implementation that uses radix tree to manage failure reports more efficiently and robustly.
This is the simplest and most targeted fix: we keep just the radix tree to maintain sorted reports. This avoids adding a new composite structure or changing the core failure report API. It directly addresses the main bottleneck: clusterNodeCleanupFailureReports() by enabling fast expiration with minimal code change. it’s sufficient and easy to maintain.
To reduce memory overhead and excessive node splits caused by millisecond level keys, we need to round expiry timestamps up to the nearest second. This time bucketing keeps rax structure compact enough.

Performance Test

The test is performed on m7g.2xlarge with 2000 nodes cluster (1000 primaries/1000 replicas)

  • Original implementation shows 100% CPU utilization during 300 nodes failover. In this case, cleanupFailreReports accounts for around 60% of the total CPU usage.
    image
    image
    image

  • Current implemenation shows ~30% CPU utilization during 450 nodes failover. (Tested multiple times)

image

@sungming2 sungming2 changed the title Failure report Improve performance bottleneck from cluster failure report under heavy failure load Jun 26, 2025
@sungming2 sungming2 changed the title Improve performance bottleneck from cluster failure report under heavy failure load Optimize cluster failure report handling under heavy failure load Jun 26, 2025
Copy link

codecov bot commented Jun 26, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.29%. Comparing base (c782b5a) to head (d62fc10).
⚠️ Report is 19 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2277      +/-   ##
============================================
- Coverage     71.41%   71.29%   -0.13%     
============================================
  Files           123      123              
  Lines         67092    67160      +68     
============================================
- Hits          47913    47879      -34     
- Misses        19179    19281     +102     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.82% <100.00%> (+0.02%) ⬆️

... and 25 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sungming2 sungming2 changed the title Optimize cluster failure report handling under heavy failure load Optimize cluster failure report Jun 26, 2025
@sungming2 sungming2 marked this pull request as ready for review June 26, 2025 09:49
Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level cursory thought:

If we plan to introduce Linked List + hashtable combination to reduce the time complexity, I would prefer to be independent API(s) to consume. Currently, it looks quite easy to introduce bugs and difficult to unit test.

Copy link
Contributor

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took an initial look at the approach. It's nice to see the improvement in time complexity but I am unsure how to test it.

@hpatro
Copy link
Collaborator

hpatro commented Jul 1, 2025

@sungming2 I was alluding to introducing a separate data structure file, let's say lru_cache.h/lru_cache.c and use it from methods in cluster file. Through that we could add unit test to the add/remove functionality of lru_cache.c/lru_cache.h.

The API(s) could be:

lruNew, lruPut, lruGet, lruDelete, lruFree

@hpatro
Copy link
Collaborator

hpatro commented Jul 1, 2025

@sungming2 I was alluding to introducing a separate data structure file, let's say lru_cache.h/lru_cache.c and use it from methods in cluster file. Through that we could add unit test to the add/remove functionality of lru_cache.c/lru_cache.h.

The API(s) could be:

lruNew, lruPut, lruGet, lruDelete, lruFree

@madolson / @sarthakaggarwal97 do you agree with this approach ? I agree with @sungming2's approach about the usage of this data structure. I couldn't think of any other alternative to address the high CPU utilization.

@sarthakaggarwal97
Copy link
Contributor

@madolson / @sarthakaggarwal97 do you agree with this approach ? I agree with @sungming2's approach about the usage of this data structure. I couldn't think of any other alternative to address the high CPU utilization.

yeah the data structures look fine to me. I had this idea where we can use expiry timestamp as the key in dictionary, the values could be the list of nodes and evict based on current time. But I haven't put a lot of thought if this can address the problem.

Copy link
Contributor

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at the APIs. I might have to think more if we can break something we these API but dropped few comments for you till then. Thanks for following up on this,

Seungmin Lee added 8 commits July 10, 2025 03:17
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
@hpatro
Copy link
Collaborator

hpatro commented Jul 18, 2025

@sungming2 is currently evaluating if vset would do the job. We will get back soon on this.

@sungming2
Copy link
Contributor Author

sungming2 commented Jul 22, 2025

There have been multiple solutions discussed for improving failure report handling. Each has different trade offs in terms of complexity, maintainability, and performance:

  1. List (original failure report)
    A simple linked list of reports, but operations like refresh, removal, and expiration all require full scans (O(N)), making it inefficient at scale.

  2. Dict + List
    Adds a dictionary for O(1) lookup, update, and removal of failure reports. The list still holds reports in insertion order, so sorting is done during insertions. While it improves access performance, maintaining both structures adds code complexity.

  3. Dict + RAX
    Similar to (dict+list), but uses a radix tree to maintain sorted order automatically. This reduces code for ordering and makes expiration scans more efficient. Still requires managing two structures.

  4. Dict + VSET (vset link)
    Similar to (dict+rax), vset uses hybrid structure that internally switches between vector, rax, and hash representations based on number of entries with timebucket mechanism. While promising in theory, the API is restrictive, and it’s currently optimized for hash field expirations. May not generalize well for our case without additional API work and it still needs two structures.

  5. RAX only
    This is the simplest and most targeted fix: we keep just the radix tree to maintain sorted reports. This avoids adding a new composite structure or changing the core failure report API. It directly addresses the main bottleneck: clusterNodeCleanupFailureReports() by enabling fast expiration with minimal code change. it’s sufficient and easy to maintain.
    To reduce memory overhead and excessive node splits caused by millisecond level keys, we need to round expiry timestamps up to the nearest second. This time bucketing keeps rax structure compact enough.

In short, I think rax only is a minimal, targeted fix that improves cleanup efficiency while keeping the implementation clean and focused.

Operation list (original) rax dict + list dict + radix
Add new report O(1) (add tail) O(L)=O(1) (raxInsert) O(1) (dict lookup)
O(N) (scan list backward worst‑case)
O(1) (list insert/delete)
O(1) (dictReplace)
Overall: O(N) worst‑case, O(1) monotonic-case
O(1) (dict lookup)
O(L) (raxInsert)
O(1) (dictAdd)
Overall: O(L)=O(1)
Refresh existing report O(N) (full-scan update) O(N) (full-scan update) O(1) (dict lookup)
O(1) (list insert/delete)
O(1) (dictReplace)
Overall: O(1)
O(1) (dict lookup)
O(L) (raxRemove)
O(L) (raxInsert)
O(1) (dictReplace)
Overall: O(L)=O(1)
Remove a report early O(N) (full-scan remove) O(N) (full-scan remove) O(1) (dict lookup)
O(1) (listDelNode)
O(1) (dictDelete)
Overall: O(1)
O(1) (dict lookup)
O(L) (raxRemove)
O(1) (dictDelete)
Overall: O(L)=O(1)
Count reports O(N) (expire reports)
O(1) (listSize)
Overall: O(N)
O(E) (expire reports)
O(1) (rax size)
Overall: O(E)
O(E) (expire reports)
O(1) (list length)
Overall: O(E)
O(L*E) = O(E) (expire reports)
O(1) (dictSize)
Overall: O(E)
Expire old reports O(N) (full-scan expiration) O(E) (scan/remove E items)
Overall: O(E)
O(E) (scan/remove E items)
Overall: O(E)
O(L) per expired entry (next+remove E items)
Overall: O(L*E)=O(E)
CPU Utilization (450 failover) 100% 35% 30% 32%
  • L = length of the key for radix = 16bytes (8 expiry + 8 node)

@hpatro
Copy link
Collaborator

hpatro commented Jul 22, 2025

Thanks @sungming2 for the thorough analysis #2277 (comment). Great work here, thanks for the patience.

RAX only
This is the simplest and most targeted fix: we keep just the radix tree to maintain sorted reports. This avoids adding a new composite structure or changing the core failure report API. It directly addresses the main bottleneck: clusterNodeCleanupFailureReports() by enabling fast expiration with minimal code change. it’s sufficient and easy to maintain.
To reduce memory overhead and excessive node splits caused by millisecond level keys, we need to round expiry timestamps up to the nearest second. This time bucketing keeps rax structure compact enough.

I'm aligned to take the RAX only approach forward. The trick which helped us avoid the CPU utilization spike is by getting the failure report ordered via RAX and then by rounding off to the nearest second helped with grouping and help cleanup faster. The change will be also very limited in scope.

@sarthakaggarwal97 / @madolson Please share your thoughts.

@sarthakaggarwal97
Copy link
Contributor

sarthakaggarwal97 commented Jul 22, 2025

The approach with using just RAX sounds good to me too. It is simple, doesn't require adding an additional custom data structure and keeps the diff small. Also, rather than nearest second (lower or upper), I would probably like the next nearest second (just upper), so that if there is a margin of error for expiry, better give more time to avoid an extra cycle to achieve a quorum.

Copy link
Contributor

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation looks so simple! Thanks for churning through different approaches @sungming2. Some minor comments from me.

Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
@sarthakaggarwal97
Copy link
Contributor

@sungming2 please change the PR description to reflect the current implementation of the PR when you get a chance.

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM.

Could we update the failure-marking.tcl test to verify there is no failure report left at the end?

We could add the following to the test Only primary with slots has the right to mark a node as failed

# Check there are no failure reports left.
wait_for_condition 1000 50 {
    [R 0 CLUSTER COUNT-FAILURE-REPORTS $replica_id] == 0 &&
    [R 2 CLUSTER COUNT-FAILURE-REPORTS $replica_id] == 0 &&
    [R 3 CLUSTER COUNT-FAILURE-REPORTS $replica_id] == 0 &&
    [R 4 CLUSTER COUNT-FAILURE-REPORTS $replica_id] == 0
} else {
    fail "Cluster COUNT-FAILURE-REPORTS is not right."
}

Seungmin Lee added 2 commits July 24, 2025 17:56
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

For others reviewing it, the failure report now is rounded up to the nearest second (ceil). So for the cleanup we will delay it by maximum of a second which seems reasonable to me. However, there is no delay in failover.

@hpatro
Copy link
Collaborator

hpatro commented Jul 25, 2025

@madolson / @enjoy-binbin Would be nice if one of you could take a pass.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nitpicks and a comment about 32bit decodes. I'll also throw on the run extra tests so we get a 32bit test run.

@madolson madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jul 25, 2025
Signed-off-by: Seungmin Lee <sungming@amazon.com>
@hpatro hpatro merged commit 2a44506 into valkey-io:unstable Jul 28, 2025
98 of 102 checks passed
const size_t node_ptr_pad_bytes = (sizeof(clusterNode *) == 4) ? 4 : 0; // pad on 32-bit

/* Round up to the next second for fewer key splits and quorum grace */
mstime_t bucketed_time = (report_time / SEC_IN_MS) * SEC_IN_MS + SEC_IN_MS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking if we can ignore the hour part and tens of minutes in the timestamp (HH:MM:SS) to create fewer nodes in the radix tree. In some corner cases, say if the report for a failed node is received at 12:59:59, then next report will have a lot of digits changed. if it sounds valid, I can create an issue for this as well.

wdyt @hpatro @sungming2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[NEW] Performance bottleneck in clusterNodeCleanupFailureReports under heavy failure load
5 participants