Skip to content

[NEW] Performance bottleneck in clusterNodeCleanupFailureReports under heavy failure load #2139

@sungming2

Description

@sungming2

The current implementation of clusterNodeCleanupFailureReports walks the entire fail reports in the list on every invocation and performs a listDelNode() for each expired or non‑voting entry. When a cluster node has accumulated hundreds or thousands of failure reports, this function becomes a CPU hotspot.

valkey/src/cluster_legacy.c

Lines 1702 to 1721 in 3ceae81

void clusterNodeCleanupFailureReports(clusterNode *node) {
list *l = node->fail_reports;
if (!listLength(l)) return;
listNode *ln;
listIter li;
clusterNodeFailReport *fr;
mstime_t maxtime = server.cluster_node_timeout * CLUSTER_FAIL_REPORT_VALIDITY_MULT;
mstime_t now = mstime();
listRewind(l, &li);
while ((ln = listNext(&li)) != NULL) {
fr = ln->value;
if (now - fr->time > maxtime) {
listDelNode(l, ln);
} else if (!clusterNodeIsVotingPrimary(fr->node)) {
listDelNode(l, ln);
}
}
}

Image
Image
(The above result was observed during a 2,000‑node cluster failover scenario)

Current behavior:

  • Where N is the number of failure reports, O(N) scan on every call, regardless of how many reports are actually expired.

Improvement ideas:

  • Use a priority queue with expiration time, so expired nodes can be removed in O(RlogN) rather than scanned in full. (R is # of actually expired reports)
  • Use lazy deletion instead of removing them on every call
  • Use early exit when meeting quorum

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Optional for next release candidate

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions