Skip to content
This repository was archived by the owner on Jan 9, 2024. It is now read-only.
This repository was archived by the owner on Jan 9, 2024. It is now read-only.

rediscluster.cluster_failover() does not connect to the proper node #360

Open
@psd314

Description

@psd314

Description

I have a 9 node cluster with 3 masters, each master on its own machine. The node is cross-replicated so the 2 slaves on each machine are replicas of the other 2 non-local masters. I'm working on re-balancing the cluster by manually failing over a master after a machine has gone down and comes back up where two masters end up on one machine.

import rediscluster

REDIS_AUTH_PASS='authPassword'

startup_nodes = [{'host': 'XXX.XX.43.124', 'port': 6379}]
rc = rediscluster.RedisCluster(
            startup_nodes=startup_nodes,
            decode_responses=True,
            password=REDIS_AUTH_PASS)
node_id = 'slave_node_id_to_failover'
rc.cluster_failover(node_id, 'FORCE')

Expected

That the targeted slave node should be promoted to master and the associated master be demoted to a slave. I can do this successfully on this cluster via redis-cli -h <host> -p <port> -a <password> CLUSTER FAILOVER FORCE.

Actual

Traceback (most recent call last):
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/bin/ark", line 11, in <module>
    load_entry_point('arkcli', 'console_scripts', 'ark')()
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/ws/phidixon-rtp/ark_project/arkcli/src/arkcli/api.py", line 72, in invoke
    super().invoke(ctx)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/ws/phidixon-rtp/ark_project/arkcli/src/arkcli/api.py", line 72, in invoke
    super().invoke(ctx)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/ws/phidixon-rtp/ark_project/arkcli/src/arkcli/redis.py", line 121, in rebalance
    rc.cluster_failover(node['id'], 'FORCE')
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/rediscluster/client.py", line 539, in cluster_failover
    return self.execute_command('CLUSTER FAILOVER', Token(option))
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/rediscluster/utils.py", line 101, in inner
    return func(*args, **kwargs)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/rediscluster/client.py", line 410, in execute_command
    return self.parse_response(r, command, **kwargs)
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/redis/client.py", line 768, in parse_response
    response = connection.read_response()
  File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/redis/connection.py", line 638, in read_response
    raise response
redis.exceptions.ResponseError: You should send CLUSTER FAILOVER to a replica

Bug

The node_id argument in the cluster_failover method does not get passed on to execute_command and is never used to identify the proper node to connect to. From rediscluster/client.py:

    def cluster_failover(self, node_id, option):
        """
        Forces a slave to perform a manual failover of its master

        Sends to specefied node
        """
        assert option.upper() in ('FORCE', 'TAKEOVER')  # TODO: change this option handling
        return self.execute_command('CLUSTER FAILOVER', Token(option))

It looks like the option argument gets used as key to identify the node via slot lookup and the master node for that slot gets returned. Also from rediscluster/client.py:

    @clusterdown_wrapper
    def execute_command(self, *args, **kwargs):
        """
        Send a command to a node in the cluster
        """
        if not args:
            raise RedisClusterException("Unable to determine command to use")

        command = args[0]

        # If set externally we must update it before calling any commands
        if self.refresh_table_asap:
            self.connection_pool.nodes.initialize()
            self.refresh_table_asap = False

        node = self.determine_node(*args, **kwargs)
        if node:
            return self._execute_command_on_nodes(node, *args, **kwargs)

        redirect_addr = None
        asking = False
        is_read_replica = False

        try_random_node = False

arg[1], the option arg in cluster_failover, gets used to determine slot in self._determine_slot(*args)

        slot = self._determine_slot(*args)
        ttl = int(self.RedisClusterRequestTTL)

        while ttl > 0:
            ttl -= 1

            if asking:
                node = self.connection_pool.nodes.nodes[redirect_addr]
                r = self.connection_pool.get_connection_by_node(node)
            elif try_random_node:
                r = self.connection_pool.get_random_connection()
                try_random_node = False
            else:
                if self.refresh_table_asap:
                    # MOVED
                    node = self.connection_pool.get_master_node_by_slot(slot)
                else:

slot gets used to look up node to failover, which returns the master node where the slot is located rather than the node_id passed to cluster_failover

                    node = self.connection_pool.get_node_by_slot(slot, self.read_from_replicas and (command in self.READ_COMMANDS))
                    is_read_replica = node['server_type'] == 'slave'
                r = self.connection_pool.get_connection_by_node(node)

I was able to work around this and get the slave to initiate the failover by finding the target node in rc.connection_pool.nodes.all_nodes() and use it to establish the connection to that node and send the CLUSTER FAILOVER command.

>>> node = {'host': 'XXX.XX.43.124', 'port': 6379, 'name': 'XXX.XX.43.124:6379', 'server_type': 'slave'}
>>> connection = rc.connection_pool.get_connection_by_node(node)
>>> connection.send_command('CLUSTER FAILOVER', 'FORCE')
>>> resp = rc.parse_response(connection, 'CLUSTER FAILOVER')
>>> resp
True

Tasks

Allow cluster_failover to target a slave node by id to start a manual failover.

Metadata

Metadata

Assignees

No one assigned

    Labels

    3.0.0All issues that will be looked at for 3.0.0 releaseAcceptedAccepted bug or enhancementAccepting PRtype: bug

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions