Respect REPLCONF VERSION in full sync #2735

zuiderkwast · 2025-10-14T14:53:15Z

Provide an RDB snapshot that the replica can handle, for resilience during rolling upgrades.

This permits an older replica to do a full sync from a newer primary. The primary takes the replica's announced version into account when generating the snapshot. In particular, it allows latest Valkey to send snapshots as RDB 11 to replicas running Valkey 7.2 and 8.x.

The implementation is structurally similar to how filtered snapshots with REPLCONF RDB-FILTER-ONLY works (commits 1bf6d6f, 65a7635) and to the feature negotiation that replicas initiate using REPLCONF CAPA.

If any new features that the replica can't handle (such as hash-field expiration and atomic slot migration) are in use, the full sync is aborted and the replica connection is closed.

This mechanism will allow us to do RDB changes more often. In the recent years, we have been avoiding RDB changes. With this mechanism, there is no need to avoid introducing RDB changes such as new encodings and new compression algorithms (#1962).

In my experience, providing a way to undo an upgrade makes users less worried and actually more willing to upgrade. This is true not least when Valkey is a part of a larger system which is upgraded as a whole. Valkey may be just one microservice of many within a larger system; not uncommon in on-prem deployments. If anything goes wrong (even if it's not Valkey itself) the user wants a way to roll back the whole system to the last working state.

This speaker in this talk presents a similar experience:
https://www.youtube.com/watch?v=IIElfMbE8cw

In the talk, it's mentioned that rolling back a system can even be part of compliance and SLA requirements.

codecov · 2025-10-14T15:14:01Z

Codecov Report

❌ Patch coverage is 89.65517% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.41%. Comparing base (8182f4a) to head (ae6567d).
⚠️ Report is 25 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/rdb.c	87.50%	4 Missing ⚠️
src/cluster_migrateslots.c	50.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2735      +/-   ##
============================================
- Coverage     72.58%   72.41%   -0.18%     
============================================
  Files           128      128              
  Lines         71277    70165    -1112     
============================================
- Hits          51736    50807     -929     
+ Misses        19541    19358     -183

Files with missing lines	Coverage Δ
src/aof.c	`81.07% <100.00%> (-0.11%)`	⬇️
src/cluster.c	`90.22% <100.00%> (-0.03%)`	⬇️
src/rdb.h	`100.00% <ø> (ø)`
src/replication.c	`85.93% <100.00%> (-0.18%)`	⬇️
src/server.h	`100.00% <ø> (ø)`
src/cluster_migrateslots.c	`92.01% <50.00%> (-0.07%)`	⬇️
src/rdb.c	`76.69% <87.50%> (-0.43%)`	⬇️

... and 100 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

madolson · 2025-10-14T17:10:37Z

Where a command like HSETEX is replicated to replica that doesn't understand it, the replica logs a warning and ignores it (the default behavior with the config 'propagation-error-behavior ignore').

Given that this is a config for replica behavior, maybe we should also have a config here? AWS has the propagation-error-behavior ignore config set to panic. AWS (and I) think consistency is extremely important, since having an inconsistent cache can cause data corruption being sent to users, even on a replica. Throwing an error in the log really doesn't do all that much, since most people don't have alarming set up on it. If there is data from the newer version, I think there should be an option to abort it.

If we were to add a config rdb-unsupported-feautre-behavior, I would vote to have the default be stricter. It could also be made more generic.

This speaker in this talk presents a similar experience: https://www.youtube.com/watch?v=IIElfMbE8cw

AWS also quietly launched a downgrade feature from 7.2 -> 7.1, and users have mentioned the most important benefit is that they feel a lot more comfortable to do the upgrade.

The code looks mostly OK, but I didn't dive too deeply yet. I know this was contentious in the past, so it would be good to get @enjoy-binbin @soloestoy @PingXie @hwware to chime in as well. I think we probably need to just make a decision here since it keeps coming up as important.

zuiderkwast · 2025-10-14T20:23:19Z

If there is data from the newer version, I think there should be an option to abort it.

If we were to add a config rdb-unsupported-feautre-behavior, I would vote to have the default be stricter. It could also be made more generic.

Fine! Let's always abort in this case then? We don't need a config IMHO. What matters is that it works when no new features are used. If the user has started playing with HFE keys, they need to delete them and then it will work.

madolson · 2025-10-14T21:03:30Z

Fine! Let's always abort in this case then? We don't need a config IMHO. What matters is that it works when no new features are used. If the user has started playing with HFE keys, they need to delete them and then it will work.

Yeah, I'm OK with this too.

hwware · 2025-10-15T13:57:00Z

I think I will vote as abstention, because in our side, we have no this option to our users, and we have no idea which kind of surprise could happen. But I think this feature is useful for some other users, thus I vote "abstention"

Only in full sync. Use the replica's REPLCONF VERSION to select RDB version. Abort if any data that can't be represented in the older RDB version is found. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

enjoy-binbin

In my experience, providing a way to undo an upgrade makes users less worried and actually more willing to upgrade. This is true not least when Valkey is a part of a larger system which is upgraded as a whole. Valkey may be just one microservice of many within a larger system; not uncommon in on-prem deployments. If anything goes wrong (even if it's not Valkey itself) the user wants a way to roll back the whole system to the last working state.

I agree with this, and this is indeed an important one, i am in it. The code mostly LGTM but i did not dive too deeply, just comment that i am ok with this

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

github-actions bot assigned zuiderkwast Oct 14, 2025

Produce RDB 11 for replicas older than 9.0

d04cf78

Only in full sync. Use the replica's REPLCONF VERSION to select RDB version. Abort if any data that can't be represented in the older RDB version is found. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

zuiderkwast force-pushed the downgrade branch from a5e0272 to d04cf78 Compare October 22, 2025 09:14

enjoy-binbin reviewed Oct 25, 2025

View reviewed changes

Minor changes

ae6567d

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

zuiderkwast marked this pull request as ready for review October 27, 2025 13:35

zuiderkwast added this to Valkey 9.1 Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Respect REPLCONF VERSION in full sync #2735

Respect REPLCONF VERSION in full sync #2735

zuiderkwast commented Oct 14, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 14, 2025 •

edited

Loading

Uh oh!

madolson commented Oct 14, 2025

Uh oh!

zuiderkwast commented Oct 14, 2025

Uh oh!

madolson commented Oct 14, 2025

Uh oh!

hwware commented Oct 15, 2025

Uh oh!

enjoy-binbin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Respect REPLCONF VERSION in full sync #2735

Are you sure you want to change the base?

Respect REPLCONF VERSION in full sync #2735

Conversation

zuiderkwast commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

madolson commented Oct 14, 2025

Uh oh!

zuiderkwast commented Oct 14, 2025

Uh oh!

madolson commented Oct 14, 2025

Uh oh!

hwware commented Oct 15, 2025

Uh oh!

enjoy-binbin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zuiderkwast commented Oct 14, 2025 •

edited

Loading

codecov bot commented Oct 14, 2025 •

edited

Loading