The smaller config epoch primary will become the replica when two primaries in the same shard #2279

enjoy-binbin · 2025-06-26T16:33:27Z

In #2261, there is a scenario that results in two primaries in a shard.

Let's say a shard has a primary(A) and a replica(B). Also let's assume that
cluster-allow-replica-migration is disabled.

Node A goes down, and Node B takes over primaryship.
Node A continues to be down while another Node C is added as a replica of B.
Node B goes down, and Node C takes over primaryship.
Node A and Node B come back up and start learning about the topology.
Node A comes up thinking it was the primary (but has an older config epoch compared to C).
Node A learns about Node C via gossip and assigns it a random shard_id.
Node A receives a direct ping from Node C.
a. Node C advertises the same set of slots that Node A was earlier owning.
b. Since Node A assigns a random shard ID to Node C, Node A thinks that it is still a
primary and it lost all its slots to Node C, which is in another shard.
Node A then updates the actual shard_id of Node C while processing shard_id in ping extensions.
Node A and Node C end up being primaries in the same shard while Node C continues to own slots.

So in the step 8, Node A and Node C has the same shard_id and both report themselves
as a primary, but Node C has a bigger config epoch, i guess in this case, we can try
to make Node A become a replica of Node C.

Fixes #2261.

An ASCII seq diagram:

+------+          +------+          +------+
|Node A|          |Node B|          |Node C|
+------+          +------+          +------+
   |                 |                 |
   | <Primary>       | <Replica>       |
   |                 |                 |
   X (fails)         |                 |
                     |---------------->|
                     |  (promoted to primary)
                     |                 |
                     X (fails)         |
                                       |
                                       |---------->
                                       | (promoted to primary)
   | (rejoins)                         |
   | (no extensions shared)            |
   |---------------------------------->|
   | (assigns random shard_id)|
   |<----------------------------------|
   | (direct ping: slot info)          |
   | Updates C’s shard_id correctly    |
   | Detects slot conflict             |
   |                                  
   |                                  
+-----------------------------------------------+
| Outcome: Nodes A & C both see themselves as   |
| primary temporarily until shard_id is updated |
| and converged correctly based on epoch value  |
+-----------------------------------------------+

…maries in the same shard Signed-off-by: Binbin <binloveplay1314@qq.com>

tests/unit/cluster/shardid-propagation.tcl

hpatro

Overall, the fix looks good to me.

Liked the GPT generated (slightly modified) ASCII seq diagram for this, pasting it here for others as well.

+------+          +------+          +------+
|Node A|          |Node B|          |Node C|
+------+          +------+          +------+
   |                 |                 |
   | <Primary>       | <Replica>       |
   |                 |                 |
   X (fails)         |                 |
                     |---------------->|
                     |  (promoted to primary)
                     |                 |
                     X (fails)         |
                                       |
                                       |---------->
                                       | (promoted to primary)
   | (rejoins)                         |
   | (no extensions shared)            |
   |---------------------------------->|
   | (assigns random shard_id)|
   |<----------------------------------|
   | (direct ping: slot info)          |
   | Updates C’s shard_id correctly    |
   | Detects slot conflict             |
   |                                  
   |                                  
+-----------------------------------------------+
| Outcome: Nodes A & C both see themselves as   |
| primary temporarily until shard_id is updated |
| and converged correctly based on epoch value  |
+-----------------------------------------------+

src/cluster_legacy.c

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2025-06-28T04:24:17Z

I see the test New replica inherits multiple migrating slots is failing:

47461:M 28 Jun 2025 12:18:24.199 * Two primaries in same shard, and the sender has a greater config epoch. Reconfiguring myself as a replica of 8af67730a16d4a47a947271cca28a03ef9fdf337 ()
cluster_legacy.c:5995 'myself->numslots == 0' is not true

myself cluster nodes:

d2081b715fb4b174127ec7a99de4e180de345fbd 127.0.0.1:21120@31120,,tls-port=0,shard-id=9d3ebac222cc295191331bd3dbba06e607b42019 master - 0 1751084304199 3 connected 10923-16383
ee0d3a2618bf64993ce1e71569f845aa6e33f46b 127.0.0.1:21121@31121,,tls-port=0,shard-id=d3f24d198dac6233c89454ed22f1a3a85b1e0e9f master - 0 1751084304199 2 connected 5462-10922
8af67730a16d4a47a947271cca28a03ef9fdf337 127.0.0.1:21119@31119,,tls-port=0,shard-id=d5ada9160b2d0a5196dcdfe0325f6accf187dec0 master - 0 1751084304199 4 connected
21fe04bcaaf3eb6dfb6217e27b3100dccb0993e4 127.0.0.1:21118@31118,,tls-port=0,shard-id=d3f24d198dac6233c89454ed22f1a3a85b1e0e9f slave ee0d3a2618bf64993ce1e71569f845aa6e33f46b 0 1751084304199 2 connected
1497ec4b6d1ecdc419741306f8d1ae44d9e8fc9d 127.0.0.1:21117@31117,,tls-port=0,shard-id=9d3ebac222cc295191331bd3dbba06e607b42019 slave d2081b715fb4b174127ec7a99de4e180de345fbd 0 1751084304199 3 connected
7c0a9f9b0610f3a3c4d895282ed41f51f9f97a5d 127.0.0.1:21122@31122,,tls-port=0,shard-id=d5ada9160b2d0a5196dcdfe0325f6accf187dec0 myself,master - 0 0 1 connected 0-5461 [7->-ee0d3a2618bf64993ce1e71569f845aa6e33f46b] [13->-ee0d3a2618bf64993ce1e71569f845aa6e33f46b] [17->-ee0d3a2618bf64993ce1e71569f845aa6e33f46b]
```

I will take a look later this day.

enjoy-binbin · 2025-06-28T13:11:50Z

The problem seems to be, a primary used to had a larger config epoch, and then it became a replica, and then it update the shard_id, and then it doing the cluster reset soft and became a empty primary, but remain the larger config epoch and its shard_id, and causing this problem. IT is also the same issue, two primaries in the same shard, but the empty primary had a larger config epoch.

See ##2283. I think when a replica doing the CLUSTER RESET SOFT, we should generate a new shard_id for it.

zuiderkwast · 2025-07-04T22:46:23Z

and then it doing the cluster reset soft and became a empty primary, but remain the larger config epoch and its shard_id, and causing this problem

This test case doesn's use CLUSTER RESET SOFT. ❓

zuiderkwast

LGTM in general.

zuiderkwast · 2025-07-04T22:34:40Z

tests/unit/cluster/shardid-propagation.tcl

+#       it lost all its slots to Node C, which is in another shard.
+# 8. Node A then updates the actual `shard_id` of Node C while processing `shard_id` in ping extensions.
+# 9. Node A and Node C end up being primaries in the same shard while Node C continues to own slots.
+start_cluster 3 2 {tags {external:skip cluster}} {


We can put some config we want for all nodes here.

Suggested change

start_cluster 3 2 {tags {external:skip cluster}} {

start_cluster 3 2 {tags {external:skip cluster} overrides {allow-replica-migration no}} {

yes, we can, here i am only explicitly setting for R A (i will apply this change later)

enjoy-binbin · 2025-07-05T03:01:28Z

This test case doesn's use CLUSTER RESET SOFT. ❓

This is the other test case, i handle it in #2283

    test "New replica inherits multiple migrating slots" {
        # Reset R3 to turn it into an empty node
        assert_equal {OK} [R 3 CLUSTER RESET]
        # Add R3 back as a replica of R0
        assert_equal {OK} [R 3 CLUSTER MEET [srv 0 "host"] [srv 0 "port"]]
        wait_for_role 0 master
        assert_equal {OK} [R 3 CLUSTER REPLICATE $R0_id]
        wait_for_role 3 slave
        # Validate final states
        wait_for_slot_state 3 "\[7->-$R1_id\] \[13->-$R1_id\] \[17->-$R1_id\]"
    }

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2025-07-05T05:00:40Z

src/cluster_legacy.c

+        /* If after processing everything, we find that myself and sender are on the
+         * same shard and are both primaries, if myself config epoch is smaller, make
+         * it a replica. */
+        if (sender && nodeIsPrimary(myself) && nodeIsPrimary(sender) && areInSameShard(myself, sender) &&
+            nodeEpoch(sender) > nodeEpoch(myself)) {
+            clusterHandlePrimariesSameShardCollision(sender);
+        }


The new test added in ##2283 now currently will fail in here:

86095:M 05 Jul 2025 12:57:44.202 # === ASSERTION FAILED === 86095:M 05 Jul 2025 12:57:44.202 # ==> cluster_legacy.c:6020 'myself->numslots == 0' is not true ------ STACK TRACE ------ Backtrace: 0 valkey-server 0x000000010227faac updateShardId + 0 1 valkey-server 0x000000010227d12c clusterReadHandler + 7736 2 valkey-server 0x00000001022f8040 connSocketEventHandler + 180 3 valkey-server 0x00000001021bedd8 aeProcessEvents + 372 4 valkey-server 0x00000001021e8f50 main + 26864 5 dyld 0x0000000180b13154 start + 2476

That is because we drop the link->support_extension in here #2283 (comment)

The shard_id was not updated in time, which led to a misjudgment. So we either need to support link->support_extension or add node->numslots judgment here.

Signed-off-by: Binbin <binloveplay1314@qq.com>

… faster during handshake (#2310) Add support_extension to cluster link, so that when myself receives a message from sender, even if myself does not know the sender, for example during handshake the sender node is still treated as random node, in thi case, the packet we reply to sender can also carry the extension, since the sender link explicitly indicates that it supports the extension, by doing this, we can speed up the spread of the extension. Also see #2283 and #2279 for more details.

Signed-off-by: Binbin <binloveplay1314@qq.com>

codecov · 2025-07-16T04:47:25Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.46%. Comparing base (78e7208) to head (9c749e6).
Report is 7 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2279      +/-   ##
============================================
+ Coverage     71.42%   71.46%   +0.03%     
============================================
  Files           123      123              
  Lines         67135    67147      +12     
============================================
+ Hits          47952    47987      +35     
+ Misses        19183    19160      -23

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.94% <100.00%> (-0.16%)`	⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Binbin <binloveplay1314@qq.com>

The smaller config epoch primary will become the replica when two pri…

d49eb01

…maries in the same shard Signed-off-by: Binbin <binloveplay1314@qq.com>

deepakrn reviewed Jun 27, 2025

View reviewed changes

tests/unit/cluster/shardid-propagation.tcl Outdated Show resolved Hide resolved

deepakrn reviewed Jun 27, 2025

View reviewed changes

tests/unit/cluster/shardid-propagation.tcl Outdated Show resolved Hide resolved

hpatro reviewed Jun 27, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

code review

990691b

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin marked this pull request as draft June 28, 2025 15:07

zuiderkwast reviewed Jul 4, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/unstable' into issue_2261

cc58cd4

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin commented Jul 5, 2025

View reviewed changes

enjoy-binbin added this to Valkey 9.0 Jul 5, 2025

enjoy-binbin moved this to Todo in Valkey 9.0 Jul 5, 2025

enjoy-binbin mentioned this pull request Jul 5, 2025

Add extensions supported to cluster link layer to propagate extension faster during handshake #2310

Merged

enjoy-binbin added 2 commits July 11, 2025 19:01

Merge remote-tracking branch 'upstream/unstable' into issue_2261

2dbc7aa

Signed-off-by: Binbin <binloveplay1314@qq.com>

Try to move the process of extension to the front

78bb09e

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jul 11, 2025

enjoy-binbin added 2 commits July 16, 2025 10:27

Merge remote-tracking branch 'upstream/unstable' into issue_2261

bf3f54c

Signed-off-by: Binbin <binloveplay1314@qq.com>

Dont move clusterProcessPingExtensions to the front

f72b1b5

Signed-off-by: Binbin <binloveplay1314@qq.com>

still need to check if myself is a empty primary

9c749e6

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin marked this pull request as ready for review July 23, 2025 08:35

enjoy-binbin requested a review from PingXie July 23, 2025 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The smaller config epoch primary will become the replica when two primaries in the same shard #2279

The smaller config epoch primary will become the replica when two primaries in the same shard #2279

enjoy-binbin commented Jun 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

hpatro left a comment

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin commented Jun 28, 2025

Uh oh!

enjoy-binbin commented Jun 28, 2025 •

edited

Loading

Uh oh!

zuiderkwast commented Jul 4, 2025

Uh oh!

zuiderkwast left a comment

Uh oh!

zuiderkwast Jul 4, 2025

Uh oh!

enjoy-binbin Jul 5, 2025 •

edited

Loading

Uh oh!

enjoy-binbin commented Jul 5, 2025

Uh oh!

enjoy-binbin Jul 5, 2025

Uh oh!

codecov bot commented Jul 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

	start_cluster 3 2 {tags {external:skip cluster}} {
	start_cluster 3 2 {tags {external:skip cluster} overrides {allow-replica-migration no}} {

The smaller config epoch primary will become the replica when two primaries in the same shard #2279

Are you sure you want to change the base?

The smaller config epoch primary will become the replica when two primaries in the same shard #2279

Conversation

enjoy-binbin commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin commented Jun 28, 2025

Uh oh!

enjoy-binbin commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Jul 4, 2025

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented Jul 5, 2025

Uh oh!

enjoy-binbin Jul 5, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

enjoy-binbin commented Jun 26, 2025 •

edited

Loading

enjoy-binbin commented Jun 28, 2025 •

edited

Loading

enjoy-binbin Jul 5, 2025 •

edited

Loading

codecov bot commented Jul 16, 2025 •

edited

Loading