stop master on AOF short write if there are enough good replicas #2375

kronwerk · 2025-07-23T15:28:46Z

when we have a primary disk filled with AOF we might finally have it stalled forever in that state - in this PR the config option is added to kill master having enough good replicas on such occasion (with recently added failover attempt for cluster-version reused)

Signed-off-by: kronwerk <kronwerk@users.noreply.github.com>

codecov · 2025-07-24T11:14:49Z

Codecov Report

Attention: Patch coverage is 47.36842% with 10 lines in your changes missing coverage. Please review.

Project coverage is 71.41%. Comparing base (a739531) to head (e431225).

Files with missing lines	Patch %	Lines
src/aof.c	0.00%	7 Missing ⚠️
src/eval.c	50.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##           unstable    #2375   +/-   ##
=========================================
  Coverage     71.41%   71.41%           
=========================================
  Files           123      123           
  Lines         67139    67153   +14     
=========================================
+ Hits          47947    47959   +12     
- Misses        19192    19194    +2

Files with missing lines	Coverage Δ
src/config.c	`78.47% <ø> (ø)`
src/replication.c	`86.73% <100.00%> (-0.07%)`	⬇️
src/server.h	`100.00% <ø> (ø)`
src/valkey-benchmark.c	`61.49% <ø> (+0.21%)`	⬆️
src/eval.c	`87.43% <50.00%> (-0.68%)`	⬇️
src/aof.c	`80.12% <0.00%> (-0.41%)`	⬇️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: kronwerk <kronwerk@users.noreply.github.com>

murphyjacob4

Thanks for the PR! I think there is definitely a gap here - thanks for identifying it and putting together a solution. I have a few comments about the approach, please take a look!

valkey.conf

src/aof.c

murphyjacob4 · 2025-07-25T06:59:48Z

src/replication.c

    listIter li;
    listNode *ln;
    int good = 0;

-    if (!server.repl_min_replicas_to_write || !server.repl_min_replicas_max_lag) return;
-
    listRewind(server.replicas, &li);
    while ((ln = listNext(&li))) {
        client *replica = ln->value;
        time_t lag = server.unixtime - replica->repl_data->repl_ack_time;


Lag here means "how long (in seconds) since our last REPLCONF ACK" response

If we are testing lag==0, that doesn't necessarily guarantee that the replica is fully caught up. It just means that we have gotten a REPLCONF ACK on the current second (this could have been up to 999ms ago in the worst case).

Should we compare the replication offset instead?

@murphyjacob4 hmmm, this was introduced 12 years ago - maybe could be improved.

not sure though about the right way to do it - in my opinion repl_min_replicas_max_lag is a way to allow considering a replica "good" even if there is a set-by-user lag.

if we want to switch to offsets - what should we do? ignore repl_min_replicas_max_lag, deprecate it, introduce repl_min_replicas_max_offset? would it be convenient and clear for a user which offset is enough to use for the same use case? or repl_min_replicas_max_offset should be always 0 (equal offsets for primary and replica) - but in this case we lose this functionality (allowing to consider lagged replica "good"), is it ok for us?

Signed-off-by: kronwerk <kronwerk@users.noreply.github.com>

stop master on AOF short write if there are enough good replicas

41b68e8

Signed-off-by: kronwerk <kronwerk@users.noreply.github.com>

kronwerk force-pushed the f/stop_master branch from 0bad841 to d6601a3 Compare July 24, 2025 10:58

add failover for cluster version

68935b1

Signed-off-by: kronwerk <kronwerk@users.noreply.github.com>

kronwerk force-pushed the f/stop_master branch 3 times, most recently from 223d7a7 to 6489e52 Compare July 24, 2025 15:12

fix alignment for build-32bit

9dd58a8

Signed-off-by: kronwerk <kronwerk@users.noreply.github.com>

kronwerk force-pushed the f/stop_master branch from 6489e52 to 9dd58a8 Compare July 24, 2025 16:35

kronwerk marked this pull request as ready for review July 24, 2025 17:25

murphyjacob4 reviewed Jul 25, 2025

View reviewed changes

fix review comments

e431225

Signed-off-by: kronwerk <kronwerk@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

stop master on AOF short write if there are enough good replicas #2375

stop master on AOF short write if there are enough good replicas #2375

kronwerk commented Jul 23, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jul 24, 2025 •

edited

Loading

Uh oh!

murphyjacob4 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

murphyjacob4 Jul 25, 2025

Uh oh!

kronwerk Jul 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

stop master on AOF short write if there are enough good replicas #2375

Are you sure you want to change the base?

stop master on AOF short write if there are enough good replicas #2375

Conversation

kronwerk commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

murphyjacob4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

murphyjacob4 Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

kronwerk Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kronwerk commented Jul 23, 2025 •

edited

Loading

codecov bot commented Jul 24, 2025 •

edited

Loading

kronwerk Jul 25, 2025 •

edited

Loading