Skip to content

[BUG] Two primaries in same shard #2261

@deepakrn

Description

@deepakrn

Describe the bug

Bug where a shard ends up with 2 primaries and when cluster-allow-replica-migration is disabled.

To reproduce

Let’s say a shard has a primary(A) and a replica(B). Also let's assume that cluster-allow-replica-migration is disabled.

  1. Node A goes down, and Node B takes over primaryship.
  2. Node A continues to be down while another Node C is added as a replica of B.
  3. Node B goes down, and Node C takes over primaryship.
  4. Node A and Node B come back up and start learning about the topology.
  5. Node A comes up thinking it was the primary (but has an older config epoch compared to C).
  6. Node A learns about Node C via gossip and assigns it a random shard_id.
  7. Node A receives a direct ping from Node C.
    a. Node C advertises the same set of slots that Node A was earlier owning.
    b. Since Node A assigns a random shard ID to Node C, Node A thinks that it is still a primary and it lost all its slots to Node C, which is in another shard.
  8. Node A then updates the actual shard_id of Node C while processing shard_id in ping extensions.
  9. Node A and Node C end up being primaries in the same shard while Node C continues to own slots.

Expected behavior

I would have expected node A to become a replica of node C after learning that node C is in the same shard.

Potential fix

Currently, when a node A receives a ping from node C, it first processes the slots config and then processes the shard_id in the ping extension. One way to fix this could be to process the shard_id or all the ping extensions and then later update slots configuration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcluster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions