Skip to content

The smaller config epoch primary will become the replica when two primaries in the same shard #2279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: unstable
Choose a base branch
from

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Jun 26, 2025

In #2261, there is a scenario that results in two primaries in a shard.

Let's say a shard has a primary(A) and a replica(B). Also let's assume that
cluster-allow-replica-migration is disabled.

  1. Node A goes down, and Node B takes over primaryship.
  2. Node A continues to be down while another Node C is added as a replica of B.
  3. Node B goes down, and Node C takes over primaryship.
  4. Node A and Node B come back up and start learning about the topology.
  5. Node A comes up thinking it was the primary (but has an older config epoch compared to C).
  6. Node A learns about Node C via gossip and assigns it a random shard_id.
  7. Node A receives a direct ping from Node C.
    a. Node C advertises the same set of slots that Node A was earlier owning.
    b. Since Node A assigns a random shard ID to Node C, Node A thinks that it is still a
    primary and it lost all its slots to Node C, which is in another shard.
  8. Node A then updates the actual shard_id of Node C while processing shard_id in ping extensions.
  9. Node A and Node C end up being primaries in the same shard while Node C continues to own slots.

So in the step 8, Node A and Node C has the same shard_id and both report themselves
as a primary, but Node C has a bigger config epoch, i guess in this case, we can try
to make Node A become a replica of Node C.

Fixes #2261.

An ASCII seq diagram:

+------+          +------+          +------+
|Node A|          |Node B|          |Node C|
+------+          +------+          +------+
   |                 |                 |
   | <Primary>       | <Replica>       |
   |                 |                 |
   X (fails)         |                 |
                     |---------------->|
                     |  (promoted to primary)
                     |                 |
                     X (fails)         |
                                       |
                                       |---------->
                                       | (promoted to primary)
   | (rejoins)                         |
   | (no extensions shared)            |
   |---------------------------------->|
   | (assigns random shard_id)|
   |<----------------------------------|
   | (direct ping: slot info)          |
   | Updates C’s shard_id correctly    |
   | Detects slot conflict             |
   |                                  
   |                                  
+-----------------------------------------------+
| Outcome: Nodes A & C both see themselves as   |
| primary temporarily until shard_id is updated |
| and converged correctly based on epoch value  |
+-----------------------------------------------+

…maries in the same shard

Signed-off-by: Binbin <binloveplay1314@qq.com>
Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the fix looks good to me.

Liked the GPT generated (slightly modified) ASCII seq diagram for this, pasting it here for others as well.

+------+          +------+          +------+
|Node A|          |Node B|          |Node C|
+------+          +------+          +------+
   |                 |                 |
   | <Primary>       | <Replica>       |
   |                 |                 |
   X (fails)         |                 |
                     |---------------->|
                     |  (promoted to primary)
                     |                 |
                     X (fails)         |
                                       |
                                       |---------->
                                       | (promoted to primary)
   | (rejoins)                         |
   | (no extensions shared)            |
   |---------------------------------->|
   | (assigns random shard_id)|
   |<----------------------------------|
   | (direct ping: slot info)          |
   | Updates C’s shard_id correctly    |
   | Detects slot conflict             |
   |                                  
   |                                  
+-----------------------------------------------+
| Outcome: Nodes A & C both see themselves as   |
| primary temporarily until shard_id is updated |
| and converged correctly based on epoch value  |
+-----------------------------------------------+

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin
Copy link
Member Author

I see the test New replica inherits multiple migrating slots is failing:

47461:M 28 Jun 2025 12:18:24.199 * Two primaries in same shard, and the sender has a greater config epoch. Reconfiguring myself as a replica of 8af67730a16d4a47a947271cca28a03ef9fdf337 ()
cluster_legacy.c:5995 'myself->numslots == 0' is not true

myself cluster nodes:

d2081b715fb4b174127ec7a99de4e180de345fbd 127.0.0.1:21120@31120,,tls-port=0,shard-id=9d3ebac222cc295191331bd3dbba06e607b42019 master - 0 1751084304199 3 connected 10923-16383
ee0d3a2618bf64993ce1e71569f845aa6e33f46b 127.0.0.1:21121@31121,,tls-port=0,shard-id=d3f24d198dac6233c89454ed22f1a3a85b1e0e9f master - 0 1751084304199 2 connected 5462-10922
8af67730a16d4a47a947271cca28a03ef9fdf337 127.0.0.1:21119@31119,,tls-port=0,shard-id=d5ada9160b2d0a5196dcdfe0325f6accf187dec0 master - 0 1751084304199 4 connected
21fe04bcaaf3eb6dfb6217e27b3100dccb0993e4 127.0.0.1:21118@31118,,tls-port=0,shard-id=d3f24d198dac6233c89454ed22f1a3a85b1e0e9f slave ee0d3a2618bf64993ce1e71569f845aa6e33f46b 0 1751084304199 2 connected
1497ec4b6d1ecdc419741306f8d1ae44d9e8fc9d 127.0.0.1:21117@31117,,tls-port=0,shard-id=9d3ebac222cc295191331bd3dbba06e607b42019 slave d2081b715fb4b174127ec7a99de4e180de345fbd 0 1751084304199 3 connected
7c0a9f9b0610f3a3c4d895282ed41f51f9f97a5d 127.0.0.1:21122@31122,,tls-port=0,shard-id=d5ada9160b2d0a5196dcdfe0325f6accf187dec0 myself,master - 0 0 1 connected 0-5461 [7->-ee0d3a2618bf64993ce1e71569f845aa6e33f46b] [13->-ee0d3a2618bf64993ce1e71569f845aa6e33f46b] [17->-ee0d3a2618bf64993ce1e71569f845aa6e33f46b]
```

I will take a look later this day.

@enjoy-binbin
Copy link
Member Author

enjoy-binbin commented Jun 28, 2025

The problem seems to be, a primary used to had a larger config epoch, and then it became a replica, and then it update the shard_id, and then it doing the cluster reset soft and became a empty primary, but remain the larger config epoch and its shard_id, and causing this problem. IT is also the same issue, two primaries in the same shard, but the empty primary had a larger config epoch.

See ##2283. I think when a replica doing the CLUSTER RESET SOFT, we should generate a new shard_id for it.

@enjoy-binbin enjoy-binbin marked this pull request as draft June 28, 2025 15:07
@zuiderkwast
Copy link
Contributor

and then it doing the cluster reset soft and became a empty primary, but remain the larger config epoch and its shard_id, and causing this problem

This test case doesn's use CLUSTER RESET SOFT. ❓

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general.

# it lost all its slots to Node C, which is in another shard.
# 8. Node A then updates the actual `shard_id` of Node C while processing `shard_id` in ping extensions.
# 9. Node A and Node C end up being primaries in the same shard while Node C continues to own slots.
start_cluster 3 2 {tags {external:skip cluster}} {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put some config we want for all nodes here.

Suggested change
start_cluster 3 2 {tags {external:skip cluster}} {
start_cluster 3 2 {tags {external:skip cluster} overrides {allow-replica-migration no}} {

Copy link
Member Author

@enjoy-binbin enjoy-binbin Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can, here i am only explicitly setting for R A (i will apply this change later)

@enjoy-binbin
Copy link
Member Author

This test case doesn's use CLUSTER RESET SOFT. ❓

This is the other test case, i handle it in #2283

    test "New replica inherits multiple migrating slots" {
        # Reset R3 to turn it into an empty node
        assert_equal {OK} [R 3 CLUSTER RESET]
        # Add R3 back as a replica of R0
        assert_equal {OK} [R 3 CLUSTER MEET [srv 0 "host"] [srv 0 "port"]]
        wait_for_role 0 master
        assert_equal {OK} [R 3 CLUSTER REPLICATE $R0_id]
        wait_for_role 3 slave
        # Validate final states
        wait_for_slot_state 3 "\[7->-$R1_id\] \[13->-$R1_id\] \[17->-$R1_id\]"
    }

Signed-off-by: Binbin <binloveplay1314@qq.com>
Comment on lines 3813 to 3819
/* If after processing everything, we find that myself and sender are on the
* same shard and are both primaries, if myself config epoch is smaller, make
* it a replica. */
if (sender && nodeIsPrimary(myself) && nodeIsPrimary(sender) && areInSameShard(myself, sender) &&
nodeEpoch(sender) > nodeEpoch(myself)) {
clusterHandlePrimariesSameShardCollision(sender);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test added in ##2283 now currently will fail in here:

86095:M 05 Jul 2025 12:57:44.202 # === ASSERTION FAILED ===
86095:M 05 Jul 2025 12:57:44.202 # ==> cluster_legacy.c:6020 'myself->numslots == 0' is not true

------ STACK TRACE ------

Backtrace:
0   valkey-server                       0x000000010227faac updateShardId + 0
1   valkey-server                       0x000000010227d12c clusterReadHandler + 7736
2   valkey-server                       0x00000001022f8040 connSocketEventHandler + 180
3   valkey-server                       0x00000001021bedd8 aeProcessEvents + 372
4   valkey-server                       0x00000001021e8f50 main + 26864
5   dyld                                0x0000000180b13154 start + 2476

That is because we drop the link->support_extension in here #2283 (comment)

The shard_id was not updated in time, which led to a misjudgment. So we either need to support link->support_extension or add node->numslots judgment here.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jul 11, 2025
enjoy-binbin added a commit that referenced this pull request Jul 16, 2025
… faster during handshake (#2310)

Add support_extension to cluster link, so that when myself receives a message from sender,
even if myself does not know the sender, for example during handshake the sender node is
still treated as random node, in thi case, the packet we reply to sender can also carry the
extension, since the sender link explicitly indicates that it supports the extension, by doing
this, we can speed up the spread of the extension.

Also see #2283 and #2279 for more details.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Copy link

codecov bot commented Jul 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.46%. Comparing base (78e7208) to head (9c749e6).
Report is 7 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2279      +/-   ##
============================================
+ Coverage     71.42%   71.46%   +0.03%     
============================================
  Files           123      123              
  Lines         67135    67147      +12     
============================================
+ Hits          47952    47987      +35     
+ Misses        19183    19160      -23     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.94% <100.00%> (-0.16%) ⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin marked this pull request as ready for review July 23, 2025 08:35
@enjoy-binbin enjoy-binbin requested a review from PingXie July 23, 2025 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

[BUG] Two primaries in same shard
4 participants