rediscluster.cluster_failover() does not connect to the proper node #360
Description
Description
I have a 9 node cluster with 3 masters, each master on its own machine. The node is cross-replicated so the 2 slaves on each machine are replicas of the other 2 non-local masters. I'm working on re-balancing the cluster by manually failing over a master after a machine has gone down and comes back up where two masters end up on one machine.
import rediscluster
REDIS_AUTH_PASS='authPassword'
startup_nodes = [{'host': 'XXX.XX.43.124', 'port': 6379}]
rc = rediscluster.RedisCluster(
startup_nodes=startup_nodes,
decode_responses=True,
password=REDIS_AUTH_PASS)
node_id = 'slave_node_id_to_failover'
rc.cluster_failover(node_id, 'FORCE')
Expected
That the targeted slave node should be promoted to master and the associated master be demoted to a slave. I can do this successfully on this cluster via redis-cli -h <host> -p <port> -a <password> CLUSTER FAILOVER FORCE
.
Actual
Traceback (most recent call last):
File "/ws/phidixon-rtp/.virtualenvs/arkcli/bin/ark", line 11, in <module>
load_entry_point('arkcli', 'console_scripts', 'ark')()
File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/ws/phidixon-rtp/ark_project/arkcli/src/arkcli/api.py", line 72, in invoke
super().invoke(ctx)
File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/ws/phidixon-rtp/ark_project/arkcli/src/arkcli/api.py", line 72, in invoke
super().invoke(ctx)
File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/ws/phidixon-rtp/ark_project/arkcli/src/arkcli/redis.py", line 121, in rebalance
rc.cluster_failover(node['id'], 'FORCE')
File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/rediscluster/client.py", line 539, in cluster_failover
return self.execute_command('CLUSTER FAILOVER', Token(option))
File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/rediscluster/utils.py", line 101, in inner
return func(*args, **kwargs)
File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/rediscluster/client.py", line 410, in execute_command
return self.parse_response(r, command, **kwargs)
File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/redis/client.py", line 768, in parse_response
response = connection.read_response()
File "/ws/phidixon-rtp/.virtualenvs/arkcli/lib/python3.7/site-packages/redis/connection.py", line 638, in read_response
raise response
redis.exceptions.ResponseError: You should send CLUSTER FAILOVER to a replica
Bug
The node_id
argument in the cluster_failover
method does not get passed on to execute_command
and is never used to identify the proper node to connect to. From rediscluster/client.py:
def cluster_failover(self, node_id, option):
"""
Forces a slave to perform a manual failover of its master
Sends to specefied node
"""
assert option.upper() in ('FORCE', 'TAKEOVER') # TODO: change this option handling
return self.execute_command('CLUSTER FAILOVER', Token(option))
It looks like the option
argument gets used as key to identify the node via slot lookup and the master node for that slot gets returned. Also from rediscluster/client.py:
@clusterdown_wrapper
def execute_command(self, *args, **kwargs):
"""
Send a command to a node in the cluster
"""
if not args:
raise RedisClusterException("Unable to determine command to use")
command = args[0]
# If set externally we must update it before calling any commands
if self.refresh_table_asap:
self.connection_pool.nodes.initialize()
self.refresh_table_asap = False
node = self.determine_node(*args, **kwargs)
if node:
return self._execute_command_on_nodes(node, *args, **kwargs)
redirect_addr = None
asking = False
is_read_replica = False
try_random_node = False
arg[1], the option
arg in cluster_failover
, gets used to determine slot in self._determine_slot(*args)
slot = self._determine_slot(*args)
ttl = int(self.RedisClusterRequestTTL)
while ttl > 0:
ttl -= 1
if asking:
node = self.connection_pool.nodes.nodes[redirect_addr]
r = self.connection_pool.get_connection_by_node(node)
elif try_random_node:
r = self.connection_pool.get_random_connection()
try_random_node = False
else:
if self.refresh_table_asap:
# MOVED
node = self.connection_pool.get_master_node_by_slot(slot)
else:
slot
gets used to look up node to failover, which returns the master node where the slot is located rather than the node_id
passed to cluster_failover
node = self.connection_pool.get_node_by_slot(slot, self.read_from_replicas and (command in self.READ_COMMANDS))
is_read_replica = node['server_type'] == 'slave'
r = self.connection_pool.get_connection_by_node(node)
I was able to work around this and get the slave to initiate the failover by finding the target node in rc.connection_pool.nodes.all_nodes()
and use it to establish the connection to that node and send the CLUSTER FAILOVER
command.
>>> node = {'host': 'XXX.XX.43.124', 'port': 6379, 'name': 'XXX.XX.43.124:6379', 'server_type': 'slave'}
>>> connection = rc.connection_pool.get_connection_by_node(node)
>>> connection.send_command('CLUSTER FAILOVER', 'FORCE')
>>> resp = rc.parse_response(connection, 'CLUSTER FAILOVER')
>>> resp
True
Tasks
Allow cluster_failover
to target a slave node by id to start a manual failover.