-
Notifications
You must be signed in to change notification settings - Fork 333
Description
If we have two schedulers with scheduler-master
and scheduler-slave
. We want to test the HA of scheduler.
However, when we stop the scheduler-master
, then the first time the dispatcher check that scheduler-master
can not be connected, then it will clear all the managed_confs
in satellitelink.update_managed_list
. And then the test in line https://github.com/naparuba/shinken/blob/master/shinken/dispatcher.py#L181 will be True, thus the first time the scheduler-master
can not be pinged, then dispatcher will redispatch the configuration originally used by scheduler-master
which is not the correct one.
I think we can take the processing of satellite HA for example, we should only do a redispatch about scheduler configuration in situations:
- the scheduler has down, thus
sched.alive
is False; - the scheduler has down however we have not reach the max retry times. then we must retry again. In this situation the condition is
sched.reachable
is False and `sched.do_i_manage(cfg_id, push_flavor) is False. - the scheduler has down, and now restarts, thus
sched.reachable
is True andsched.do_i_manage(cfg_id, push_flavor)
is False. - the network to
scheduler-master
is down, then if the network is ok before max retry times, then all is good, nothing need to be done. If the network is Non-ok before max retry times then we redispatch the configuration toscheduler-slave
, and later the network is coming ok, thencheck_bad_dispatch
will be used to let thescheduler-master
towait_for_conf
Thus, we should add one test in line 181, dispatcher.py
if sched.reachable and not sched.do_i_manage(cfg_id, push_flavor):