Skip to content

intelmqctl stop bots are still running #2595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Lukas-Heindl opened this issue Apr 9, 2025 · 7 comments · May be fixed by #2598
Open

intelmqctl stop bots are still running #2595

Lukas-Heindl opened this issue Apr 9, 2025 · 7 comments · May be fixed by #2598
Labels
component: intelmqctl good first issue Indicates a good issue for first-time contributors

Comments

@Lukas-Heindl
Copy link

Hi,

working more with intelmq these days I noticed when executing intelmqctl stop sometimes some bots are still reported as running afterwards in the output (not that big of an issue) and the exit-code is != 1 (bigger issue, since my wrapper script (using systemd for restarting and most important for starting when booting the server) reacts on this).

I noticed when running intelmqctl status after the intelmqctl stop, the bots actually are reported as stopped. Looking deeper into the code responsible for stopping the bots, I noticed intelmq(ctl) uses the following prodecure for stopping the whole botnet:

  1. Iterate over the bots in the botnet (
    for bot_id in bots:
    self.bot_stop(bot_id, getstatus=False)
    )
  2. Send them the SIGTERM signal (
    proc = psutil.Process(int(pid))
    try:
    proc.send_signal(signal.SIGTERM)
    )
  3. Wait for 0.75 Seconds (
    time.sleep(0.75)
    )
  4. Check if the bots are running (
    for bot_id in bots:
    botnet_status[bot_id] = self.bot_status(bot_id)[1]
    if botnet_status[bot_id] not in ['stopped', 'disabled']:
    retval = 1
    ) -> determines the exit code

So to me it looks like on our server it takes too long until all the bots are finally stopped (when executing intelmqctl status the bots are stopped after all). In our case we're speeking about 16 bots on a server with 4 GiB RAM and 2 cores (not that impressive specs, but so far we're not dealing with massive amounts of data and half of the bots are really just for testing purposes).

With this in mind, does my analysis make sense to you (as people knowing intelmq much better than I do)?

So far my approach would be simply increasing the time intelmqctl stop sleeps until checking on the bots (not generally, but adding this as a parameter to the CLI). Am I missing a simpler solution here?

@sebix
Copy link
Member

sebix commented Apr 9, 2025

short remark on:

bigger issue, since my wrapper script (using systemd for restarting and most important for starting when booting the server) reacts on this

You can also run the bots as systemd services directly: https://github.com/certtools/intelmq/tree/develop/contrib/systemd

I'll look into your report in more detail tomorrow

@Lukas-Heindl
Copy link
Author

short remark on:

bigger issue, since my wrapper script (using systemd for restarting and most important for starting when booting the server) reacts on this

You can also run the bots as systemd services directly: https://github.com/certtools/intelmq/tree/develop/contrib/systemd

Thought about it, but I wanted to avoid having to keep systemd and intelmq bots in sync (just one more step/layer on top). So for us it's currently ensuring that intelmq runs when the server is rebooted (and catching the case where a bot crashes and needs to be restarted is something we try by monitoring the logs).

@kamil-certat
Copy link
Contributor

Oh, this may explain some issues I sometimes see. 🤔 I'd actually suggest not "just" increasing the hardcoded time, but rebuild it a little to have x retries with a shorter sleep between them, ideally in every retry checking just outstanding bots.

@Lukas-Heindl
Copy link
Author

My intention was to simply make the time configurable (not only changing the hardcoded time). This obviously is the easy fix.

Didn't think about multiple retries yet (would reduce the latency when a larger sleep-time is configured but not actually needed). i'm not sure about whether checking more often should be done (I don't have a feeling how many bots other users usually have running, checking on a large number often might be something we want to avoid -- maybe some kind of exponential backoff and not checking on bots which we know already stopped helps here).

@sebix
Copy link
Member

sebix commented Apr 10, 2025

working more with intelmq these days I noticed when executing intelmqctl stop sometimes some bots are still reported as running afterwards in the output (not that big of an issue) and the exit-code is != 1 (bigger issue, since my wrapper script (using systemd for restarting and most important for starting when booting the server) reacts on this).

Not sure if I understand this correctly. If the operation succeeded, the exit code should be 0 and if it didn't (as in the case you describe), anything unequal to 0 is correct. What is the problem you are experiencing with the exit code?

With this in mind, does my analysis make sense to you (as people knowing intelmq much better than I do)?

Yes. It's correct and the behaviour you see (bots not exiting in time and thus causing the "confusion") is not particularly new either, but was never critical enough to address it.

The 0.75 delay was a value that worked reasonably good and was small enough to not take too long.

A proper solution would be what @kamil-certat said and the effort is approximately equal to a configurable delay.

@Lukas-Heindl
Copy link
Author

Not sure if I understand this correctly. If the operation succeeded, the exit code should be 0 and if it didn't (as in the case you describe), anything unequal to 0 is correct. What is the problem you are experiencing with the exit code?

What I meant is that the stop operation actually worked (the bots are all stopped after all) but the exit code indicates some sort of error.

Yes. It's correct and the behaviour you see (bots not exiting in time and thus causing the "confusion") is not particularly new either, but was never critical enough to address it.

Alright, didn't find an issue for it (but searching for stop gives many results so I might have missed it).
Well for us it is not critical (as in intelmq stops working) as well, but it messes with our monitoring of the server and generates some false positives indicating intelmq has stopped working.

I see this is not high on your list of priorities so you might not bother implementing this. Because of the other feature I need to set up some place where I can develop anyhow, so I'd just implement this as well (if you're willing to merge something like this).

Just to be clear you are more on the side of a simple for try in range(x) ; do <check> -> break ; sleep (y) ; done (so no exponentially increasing the sleep-time or so). Also would you bother exposing the parameters x (and maybe y to the outside)?

@sebix sebix added the good first issue Indicates a good issue for first-time contributors label Apr 11, 2025
@sebix
Copy link
Member

sebix commented Apr 11, 2025

Alright, didn't find an issue for it

I guess there is none.

Because of the other feature I need to set up some place where I can develop anyhow, so I'd just implement this as well (if you're willing to merge something like this).

That would be greatly appreciated.

Just to be clear you are more on the side of a simple for try in range(x) ; do <check> -> break ; sleep (y) ; done (so no exponentially increasing the sleep-time or so). Also would you bother exposing the parameters x (and maybe y to the outside)?

I guess increments of 0.1s and a maximum waiting time of 5s (or equivalent: maximum steps) would be sensible defaults.
Making them configurable (maybe in the global namespace as intelmqctl_stop_wait_time etc?) could be done too, but I'm not sure if it's necessary and worth the effort.

You get bonus points if the loop iterations only check the status of the not-yet stopped bots instead of checking all the bots in every iteration =) (causes fewer delays)

Lukas-Heindl added a commit to Lukas-Heindl/intelmq that referenced this issue Apr 11, 2025
retry multiple times on `intelmqctl stop` to check if bots really
stopped, since the bots might take longer to stop.
Using retry in constrast to increasing the sleep_time keeps the delay
short in case the bots did already stop.
Lukas-Heindl added a commit to Lukas-Heindl/intelmq that referenced this issue Apr 11, 2025
retry multiple times on `intelmqctl stop` to check if bots really
stopped, since the bots might take longer to stop.
Using retry in constrast to increasing the sleep_time keeps the delay
short in case the bots did already stop.
sebix pushed a commit to Lukas-Heindl/intelmq that referenced this issue Apr 28, 2025
retry multiple times on `intelmqctl stop` to check if bots really
stopped, since the bots might take longer to stop.
Using retry in constrast to increasing the sleep_time keeps the delay
short in case the bots did already stop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: intelmqctl good first issue Indicates a good issue for first-time contributors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants