Skip to content

Conversation

@zuiderkwast
Copy link
Contributor

@zuiderkwast zuiderkwast commented Oct 14, 2025

Provide an RDB snapshot that the replica can handle, for resilience during rolling upgrades.

This permits an older replica to do a full sync from a newer primary. The primary takes the replica's announced version into account when generating the snapshot. In particular, it allows latest Valkey to send snapshots as RDB 11 to replicas running Valkey 7.2 and 8.x.

The implementation is structurally similar to how filtered snapshots with REPLCONF RDB-FILTER-ONLY works (commits 1bf6d6f, 65a7635) and to the feature negotiation that replicas initiate using REPLCONF CAPA.

If any new features that the replica can't handle (such as hash-field expiration and atomic slot migration) are in use, the full sync is aborted and the replica connection is closed.

This mechanism will allow us to do RDB changes more often. In the recent years, we have been avoiding RDB changes. With this mechanism, there is no need to avoid introducing RDB changes such as new encodings and new compression algorithms (#1962).

In my experience, providing a way to undo an upgrade makes users less worried and actually more willing to upgrade. This is true not least when Valkey is a part of a larger system which is upgraded as a whole. Valkey may be just one microservice of many within a larger system; not uncommon in on-prem deployments. If anything goes wrong (even if it's not Valkey itself) the user wants a way to roll back the whole system to the last working state.

This speaker in this talk presents a similar experience:
https://www.youtube.com/watch?v=IIElfMbE8cw

In the talk, it's mentioned that rolling back a system can even be part of compliance and SLA requirements.

@codecov
Copy link

codecov bot commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 89.65517% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.41%. Comparing base (8182f4a) to head (ae6567d).
⚠️ Report is 25 commits behind head on unstable.

Files with missing lines Patch % Lines
src/rdb.c 87.50% 4 Missing ⚠️
src/cluster_migrateslots.c 50.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2735      +/-   ##
============================================
- Coverage     72.58%   72.41%   -0.18%     
============================================
  Files           128      128              
  Lines         71277    70165    -1112     
============================================
- Hits          51736    50807     -929     
+ Misses        19541    19358     -183     
Files with missing lines Coverage Δ
src/aof.c 81.07% <100.00%> (-0.11%) ⬇️
src/cluster.c 90.22% <100.00%> (-0.03%) ⬇️
src/rdb.h 100.00% <ø> (ø)
src/replication.c 85.93% <100.00%> (-0.18%) ⬇️
src/server.h 100.00% <ø> (ø)
src/cluster_migrateslots.c 92.01% <50.00%> (-0.07%) ⬇️
src/rdb.c 76.69% <87.50%> (-0.43%) ⬇️

... and 100 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@madolson
Copy link
Member

Where a command like HSETEX is replicated to replica that doesn't understand it, the replica logs a warning and ignores it (the default behavior with the config 'propagation-error-behavior ignore').

Given that this is a config for replica behavior, maybe we should also have a config here? AWS has the propagation-error-behavior ignore config set to panic. AWS (and I) think consistency is extremely important, since having an inconsistent cache can cause data corruption being sent to users, even on a replica. Throwing an error in the log really doesn't do all that much, since most people don't have alarming set up on it. If there is data from the newer version, I think there should be an option to abort it.

If we were to add a config rdb-unsupported-feautre-behavior, I would vote to have the default be stricter. It could also be made more generic.

This speaker in this talk presents a similar experience: https://www.youtube.com/watch?v=IIElfMbE8cw

AWS also quietly launched a downgrade feature from 7.2 -> 7.1, and users have mentioned the most important benefit is that they feel a lot more comfortable to do the upgrade.

The code looks mostly OK, but I didn't dive too deeply yet. I know this was contentious in the past, so it would be good to get @enjoy-binbin @soloestoy @PingXie @hwware to chime in as well. I think we probably need to just make a decision here since it keeps coming up as important.

@zuiderkwast
Copy link
Contributor Author

If there is data from the newer version, I think there should be an option to abort it.

If we were to add a config rdb-unsupported-feautre-behavior, I would vote to have the default be stricter. It could also be made more generic.

Fine! Let's always abort in this case then? We don't need a config IMHO. What matters is that it works when no new features are used. If the user has started playing with HFE keys, they need to delete them and then it will work.

@madolson
Copy link
Member

Fine! Let's always abort in this case then? We don't need a config IMHO. What matters is that it works when no new features are used. If the user has started playing with HFE keys, they need to delete them and then it will work.

Yeah, I'm OK with this too.

@hwware
Copy link
Member

hwware commented Oct 15, 2025

I think I will vote as abstention, because in our side, we have no this option to our users, and we have no idea which kind of surprise could happen. But I think this feature is useful for some other users, thus I vote "abstention"

Only in full sync.

Use the replica's REPLCONF VERSION to select RDB version.

Abort if any data that can't be represented in the older RDB version is found.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, providing a way to undo an upgrade makes users less worried and actually more willing to upgrade. This is true not least when Valkey is a part of a larger system which is upgraded as a whole. Valkey may be just one microservice of many within a larger system; not uncommon in on-prem deployments. If anything goes wrong (even if it's not Valkey itself) the user wants a way to roll back the whole system to the last working state.

I agree with this, and this is indeed an important one, i am in it. The code mostly LGTM but i did not dive too deeply, just comment that i am ok with this

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast zuiderkwast marked this pull request as ready for review October 27, 2025 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants