Skip to content

scheduler HA switches too early #1615

@andyxning

Description

@andyxning

If we have two schedulers with scheduler-master and scheduler-slave. We want to test the HA of scheduler.

However, when we stop the scheduler-master, then the first time the dispatcher check that scheduler-master can not be connected, then it will clear all the managed_confs in satellitelink.update_managed_list. And then the test in line https://github.com/naparuba/shinken/blob/master/shinken/dispatcher.py#L181 will be True, thus the first time the scheduler-master can not be pinged, then dispatcher will redispatch the configuration originally used by scheduler-master which is not the correct one.

I think we can take the processing of satellite HA for example, we should only do a redispatch about scheduler configuration in situations:

  • the scheduler has down, thus sched.alive is False;
  • the scheduler has down however we have not reach the max retry times. then we must retry again. In this situation the condition is sched.reachable is False and `sched.do_i_manage(cfg_id, push_flavor) is False.
  • the scheduler has down, and now restarts, thus sched.reachable is True and sched.do_i_manage(cfg_id, push_flavor) is False.
  • the network to scheduler-master is down, then if the network is ok before max retry times, then all is good, nothing need to be done. If the network is Non-ok before max retry times then we redispatch the configuration to scheduler-slave, and later the network is coming ok, then check_bad_dispatch will be used to let the scheduler-master to wait_for_conf

Thus, we should add one test in line 181, dispatcher.py

if sched.reachable and not sched.do_i_manage(cfg_id, push_flavor):

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions