Skip to content

[NEW] Performance bottleneck in clusterNodeCleanupFailureReports under heavy failure load #2139

@sungming2

Description

@sungming2

The current implementation of clusterNodeCleanupFailureReports walks the entire fail reports in the list on every invocation and performs a listDelNode() for each expired or non‑voting entry. When a cluster node has accumulated hundreds or thousands of failure reports, this function becomes a CPU hotspot.

valkey/src/cluster_legacy.c

Lines 1702 to 1721 in 3ceae81

void clusterNodeCleanupFailureReports(clusterNode *node) {
list *l = node->fail_reports;
if (!listLength(l)) return;
listNode *ln;
listIter li;
clusterNodeFailReport *fr;
mstime_t maxtime = server.cluster_node_timeout * CLUSTER_FAIL_REPORT_VALIDITY_MULT;
mstime_t now = mstime();
listRewind(l, &li);
while ((ln = listNext(&li)) != NULL) {
fr = ln->value;
if (now - fr->time > maxtime) {
listDelNode(l, ln);
} else if (!clusterNodeIsVotingPrimary(fr->node)) {
listDelNode(l, ln);
}
}
}

Image
Image
(The above result was observed during a 2,000‑node cluster failover scenario)

Current behavior:

  • Where N is the number of failure reports, O(N) scan on every call, regardless of how many reports are actually expired.

Improvement ideas:

  • Use a priority queue with expiration time, so expired nodes can be removed in O(RlogN) rather than scanned in full. (R is # of actually expired reports)
  • Use lazy deletion instead of removing them on every call
  • Use early exit when meeting quorum

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions