[NEW] Performance bottleneck in clusterNodeCleanupFailureReports under heavy failure load

The current implementation of `clusterNodeCleanupFailureReports` walks the entire fail reports in the list on every invocation and performs a listDelNode() for each expired or non‑voting entry. When a cluster node has accumulated hundreds or thousands of failure reports, this function becomes a CPU hotspot.

https://github.com/valkey-io/valkey/blob/3ceae81fc4cf065dae00888ab92b7aa069fde111/src/cluster_legacy.c#L1702-L1721

![Image](https://github.com/user-attachments/assets/4d4c8117-8613-4f99-860f-08d60d9f78e1)
![Image](https://github.com/user-attachments/assets/f544c279-ca8f-4134-946c-bcc6aec7f812)
(The above result was observed during a 2,000‑node cluster failover scenario)


**Current behavior:**
- Where N is the number of failure reports, O(N) scan on every call, regardless of how many reports are actually expired.

**Improvement ideas:**
- ~~Use a priority queue with expiration time, so expired nodes can be removed in O(RlogN) rather than scanned in full. (R is # of actually expired reports)~~
- Use lazy deletion instead of removing them on every call
- Use early exit when meeting quorum





	void clusterNodeCleanupFailureReports(clusterNode *node) {
	list *l = node->fail_reports;
	if (!listLength(l)) return;

	listNode *ln;
	listIter li;
	clusterNodeFailReport *fr;
	mstime_t maxtime = server.cluster_node_timeout * CLUSTER_FAIL_REPORT_VALIDITY_MULT;
	mstime_t now = mstime();

	listRewind(l, &li);
	while ((ln = listNext(&li)) != NULL) {
	fr = ln->value;
	if (now - fr->time > maxtime) {
	listDelNode(l, ln);
	} else if (!clusterNodeIsVotingPrimary(fr->node)) {
	listDelNode(l, ln);
	}
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NEW] Performance bottleneck in clusterNodeCleanupFailureReports under heavy failure load #2139

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[NEW] Performance bottleneck in clusterNodeCleanupFailureReports under heavy failure load #2139

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions