[network] Fix discovery performance causing a slow openHAB start #17972

lsiepel · 2024-12-24T14:11:24Z

Fixes: [network] Network Binding things are slow to start from OH 4.1 #17956
Fixes: [network] (Docker) Still massively slowing down boon when extended default-adress-pools are used in /etc/docker/daemon.json #16810
Fixes: [network] Possible performance improvement for performArpPing #16262

No breaking changes.
Improved logging
Improved separation of concerns
Improved parsing of ping results (<1ms)
Improved resource use (memory and threads)
Discovery takes binding configuration into consideration

should be backported to 5.0.x and 4.3.x.

Testable jar 5.1.0 snapshot (2025-09-05) or later: https://1drv.ms/u/c/70ea7ac46bc61c73/EegJadzMpp1IlQYIN5kMCfwBpyUEcQF8ReQZGRMTmTcllw?e=TQEpxt

...b.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java

lolodomo · 2025-07-12T12:19:53Z

@lsiepel : this PR is still not yet ready for review?

lsiepel · 2025-07-12T12:55:54Z

@lsiepel : this PR is still not yet ready for review?

No, hard issue. I can’t fully reproduce this issue, so the fixes I propose are merely theoretically.

...rk/src/main/java/org/openhab/binding/network/internal/discovery/NetworkDiscoveryService.java

lsiepel · 2025-08-30T12:52:10Z

@Nadahar i really appreciate your input here and i need some more.

Previously i moved the scheduled task creation from the presencedetection to the handler. A better place for lifecycle management. I also created a dedicated threadpool similar to what you did for chromecast. The threadpool made it worse according to #17956 (comment)

The discovery has made some good improvements, but the startup is still problematic. I can't figure out what is actually blocking this so much.

Nadahar · 2025-08-30T13:21:08Z

I'll try to help if I can, but I need to spend the time required to study "the whole treading structure" before I can really suggest if and how it could be improved.

I discovered that I have posted some posts in #17956 that I intended to post here, so I'll post them here now and then delete them from the issue, as they are really way too technical to be in the issue.

Nadahar · 2025-08-30T13:22:03Z

Exactly the same as i discovered, see https://github.com/openhab/openhab-addons/pull/17972/files#diff-650862c0ba8b30dccfb5784d3dcacc002ea13fb21a9acb5ec2d4dbc32f95af7f (line 221)

👍

Note: I think you can use shutdownNow() from the beginning, you want to stop the scan. Using just shutdown() won't attempt to interrupt those already running, and I think it won't even remove those still queued, it just won't accept new tasks. shotdownNow() will at least try to interrupt those already running, although most network operations aren't interruptible. But it will drain the queue if some tasks are still queued, which shutdown() doesn't do.

You don't really have a "hard kill" option (or, you do with Thread.stop() but that will cause a mess), so they will most likely just finish what they are doing anyway. For interrupts to work, the tasks themselves must be written to handle interrupts properly (that is, by existing) and might also need to check if interrupted before starting long-running operations.

Nadahar · 2025-08-30T13:22:35Z

I found another potential issue. Again, I have no idea if this actually causes any of the current problems, but: NetworkUtils.nativePing() might deadlock.

GitHub won't let me comment in that part of the code, since no changes have been done there, but, it uses ProcessBuilder:

                proc = new ProcessBuilder("ping", "-w", String.valueOf(timeout.toSeconds()), "-c", "1", hostname)
                        .start();

A Process in java generates two outputs (in the form of InputStreams) that must be drained as the process runs, or the process might "hang" waiting for the pipe to accept its output. This must normally be separate threads to make sure nothing will prevent them from draining the streams, however, if you don't care about what output is "error" and what is "normal", you can merge them into one by doing this:

                proc = new ProcessBuilder("ping", "-w", String.valueOf(timeout.toSeconds()), "-c", "1", hostname)
                        .redirectErrorStream(true).start();

Once that is done, it's enough to drain this "merged" stream from Process.getInputStream(), but this isn't done here, which means that if any stderr is produced, it will never be drained.

These pipes/streams might have some buffering, but that is OS/implementation dependent and can't be expected to "save you" (although that's the only reason this works at all).

Later down in the code:

        int result = proc.waitFor();
        if (result != 0) {
            return new PingResult(false, Duration.between(execStartTime, Instant.now()));
        }

        try (BufferedReader r = new BufferedReader(new InputStreamReader(proc.getInputStream()))) {

proc.waitFor() is called before the draining of (one of) the pipe(s) starts, which will hang if the buffer can't handle the output. You're supposed to read the pipes/streams while the process is running. Even worse, if result != 0, the streams/pipes won't be drained at all.

Further down, the reading also might exit before being finished, which is also wrong.

IMO, stdout and stderr should be merged, the resulting stream read/drained into a collection of strings (probably one string per line) and then you could potentially wait for the process to finish (but I don't think the stream will end until the process has finished).

Once this is done, results can be parsed and the appropriate action taken.

Nadahar · 2025-08-30T13:23:12Z

I'd also say that this is somewhat questionable:

    public void wakeUpIOS(InetAddress address) throws IOException {
        int port = 5353;
        try (DatagramSocket s = new DatagramSocket()) {
            // Send a valid mDNS packet (12 bytes of zeroes)
            byte[] buffer = new byte[12];
            s.send(new DatagramPacket(buffer, buffer.length, address, port));
            logger.trace("Sent packet to {}:{} to wake up iOS device", address, port);
        } catch (PortUnreachableException e) {
            logger.trace("Unable to send packet to wake up iOS device at {}:{}", address, port, e);
        }
    }

I get that it's a "cheat" to wake up sleeping iOS devices, but it's also a mDNS packet that could cause trouble. OH itself, and many other devices, use mDNS, and since it's not known in advance what device is iOS, I assume this is sent to all. This will interfere with the genuine mDNS communication OH does, and I could also imagine that this could be related to "mDNS storms" that I've read that OH can cause.

lsiepel · 2025-08-30T13:32:23Z

I think you can use shutdownNow() from the beginning, you want to stop the scan.

An easy change to test, but i can't see yet how this would cause the experienced problems.

NetworkUtils.nativePing() might deadlock.

pff, i need to inspect how this works :-) give me some time to understand.

I'd also say that this is somewhat questionable

For the discovery proces this might be true, for regular things that initalize, the wakeUpIOS() call is very specific to that device, this should not cause any effect at all to mDNS. For Discovery i can do a test to see the impact.

From #17956 (comment) it looks like the initalization of each device has a sort of fixed time. With parallellism i would expect to have a bursts if initialisations.

Nadahar · 2025-08-30T13:38:30Z

An easy change to test, but i can't see yet how this would cause the experienced problems.

No, I don't have enough overview (yet) to really have an opinion about causes the delays, so I've just written down things as I've seen them while studying the code.

For the discovery proces this might be true, for regular things that initalize, the wakeUpIOS() call is very specific to that device, this should not cause any effect at all to mDNS. For Discovery i can do a test to see the impact.

Yes, I was thinking of the discovery scan. It's hardcoded to true there, which means that there will be sent a lot of such packages, which again might trigger responses from other devices. I don't know enough about the mDNS to figure out what an all-zero packet might do/cause, but it seems "a very crude thing to do". When you have a device that you have configured as an iOS device, and you wish for this to happen, I see no problem. That's only one packet anyway, and if the iOS devices expect these, they probably won't "respond negatively" to them anyway.

Nadahar · 2025-08-30T13:52:51Z

pff, i need to inspect how this works :-) give me some time to understand.

Take your time, but FYI when you get there: A Process creates three OS level pipes (I think stdin is a pipe as well, but as it's not problematic we can ignore it), two of which are stdout and stderr. When you run this from a console/terminal, these will normally be printed in the console, unless you have redirected them somewhere else. The problem is that, although there will often be some buffering somewhere (either in the JVM, the OS or in the program producing the output), there's no guarantee that any buffering exists in this pipeline (and I've absolutely experienced this, so it's not just hypothetical - a made a "test" to cause this to happen for the Exec binding).

If the running program/process tries to write to one of these pipes and there is no buffering, or the buffer is full, their "write command" will block until there is space in the pipe. This is a normal OS "blocking I/O" operation, and will apply to any process that tries to write when there's no space to do so. Thus, the Java program that "handles" this process must make sure to always consume whatever is written to these pipes while the process is running, or it could "block forever"/deadlock. That is not what is done here, instead stdout might be partially or fully consumed after the process is finished, or it might not be consumed at all. stderr is completely ignored. One can only imagine all the problems this might cause, processes that "never finish", pipes that are never fully consumed...

After a quick look, it seems like the "arpping" command handles Process in the same way.

This isn't necessarily difficult to fix, but testing that it works properly under all circumstances can be daunting. That said, we can already "know" that the current code won't work properly under all circumstances, so it might still be worth making something more likely to work even if we're not able to test it under enough variety of circumstances.

I could consider making a PR against your PR for this part, so that you don't need to think about the process handling on top of the treading stuff.

Nadahar · 2025-08-30T16:25:29Z

@lsiepel I've checked out your branch and am trying to do some debugging. But, I'm having a lot of trouble because this branch is on 5.0 while my core repo is checked out at latest/5.1.

I could easily rebase the branch to 5.1, but that would make it "impossible" to create PRs from my branch into yours. I could check out 5.0 core, but I'd really prefer not to, because Eclipse is going to spend hours rebuilding everything.

Is there a reason why this is still at 5.0, or could you just rebase it on latest main to make things easier?

lsiepel · 2025-08-30T16:36:07Z

I can rebase, just have some commits in my main I need to preserve. First diner and kids to bed 😂

lsiepel · 2025-08-30T18:53:51Z

I can rebase, just have some commits in my main I need to preserve. First diner and kids to bed 😂

done (i first did a merge commit, but removed and now did a rebase)

Signed-off-by: Leo Siepel <leosiepel@gmail.com>

lolodomo · 2025-09-06T06:34:15Z

@lsiepel : please update (regenerate) the i18n properties file network.properties.

Signed-off-by: Leo Siepel <leosiepel@gmail.com>

lolodomo

Review part 1 of 2

...inding.network/src/main/java/org/openhab/binding/network/internal/NetworkHandlerFactory.java

...rk/src/main/java/org/openhab/binding/network/internal/discovery/NetworkDiscoveryService.java

...nding.network/src/main/java/org/openhab/binding/network/internal/handler/NetworkHandler.java

bundles/org.openhab.binding.network/src/main/resources/OH-INF/addon/addon.xml

Signed-off-by: Leo Siepel <leosiepel@gmail.com>

Nadahar · 2025-09-06T10:25:43Z

@lsiepel Since you added mocking of the admin service, you can remove the nullness annotation on the admin service reference in the constructor, and the corresponding null handling following that in the code.

edit: Oh, I see, you already did this in reverse order 😉 There's a reason why I left that @Nullable, it was because of the tests.

Signed-off-by: Leo Siepel <leosiepel@gmail.com>

lolodomo

Last part of the review

lolodomo · 2025-09-07T08:39:53Z

...ab.binding.network/src/main/java/org/openhab/binding/network/internal/PresenceDetection.java

+        } catch (InterruptedException e) {
+            Thread.currentThread().interrupt();
+            return;


Why catching the interruption exception if this is just to retrigger it ?

Same comment for other places in the same file.

It does not retrigger the exception, as in it does not re-throw the exceptions. The exception is catched, thread status is restored and the method is immediate returned to abort fast.
This is a best practice for gracefull termination. What else would you recommend?

You're right, my sentence was not enough precise, the exception is not retriggered by your code but your code is requesting the thread to interrupt again and so this will lead to a new interruption exception a moment later. in the code
That looks a little strange to me but if this is the best practice, that is fine for me.

Generally, InterruptedException isn't well understood by many. It will only ever be thrown if the thread is interrupted, and that's usually only if an executor is being shut down or if the program terminates. The JVM will interrupt all (non-daemon I think) threads as a part of the shutdown process, it's how a graceful shutdown is supposed to work.

So, generally, when interrupted, the goal is that the code should wrap up whatever it is doing and exit as soon as possible. A thread won't stop running because it is interrupted, "normal code" will keep running (it is possible for code to check isInterrupted() manually before starting long/heavy operations, but I very rarely see this done). But, most operations that are blocking will use InterruptedException as really the only way they can be "aborted".

When you catch a InterruptedException, the interrupted flag of the thread is cleared. I guess those that designed it though that it was a good idea because it would give the flexibility to use interrupts "for program flow", but I bet they didn't anticipate how few would actually familiarize themselves with how this works. Because, if the goal really is to terminate what it's doing, like here, you need the flag to "stay on" until the task has exited completely. Otherwise, it will just interrupt what it's doing right there, and then return to the calling code, which might contain other blocking operations that would then block and prevent the task from exiting.

This is why it's "recommended" to set the interrupted flag when you catch InterruptedException, when you do that, you make sure that the code will again be interrupted if it tries to sleep/wait for something.

lolodomo · 2025-09-07T09:06:24Z

@Nadahar : is it ok for you ?

Nadahar · 2025-09-07T12:56:16Z

@Nadahar : is it ok for you ?

Yes, as far as I know, this is good to go.

lolodomo

LGTM

openhab-bot · 2025-09-18T13:00:43Z

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/jmdns-memory-leak/166318/4

openhab-bot · 2025-09-27T10:10:08Z

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/openhab-5-1-milestone-discussion/166385/36

…nhab#17972) * Fix performance * Improvements * Add logging * Improve logging * Update bundles/org.openhab.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java * Improve thread handling * Improve shutdown * thread cleanup * Improve per thread allocation * Stop on finishing all interfaces * Change Arping to make use of completeablefeature * Seperate presence detection from lifecycle * Improve logging and filtering * Create seperate ScheduledExecutorService * Fix review comment * Improve network address checks * Improve interrupthandling * Revert threadlocal * Refactor Presencedetection * Make LatencyParser accept Windows' <1ms syntax for minimal latency * Remove misleading reference to non-existing NetworkHandlerBuilder * Handle thread-safety of NetworkDiscoveryService.executorService * Fix network interface exclusion * Make PresenceDetectionValue thread-safe * Partial refactoring of PresenceDetection Fixes: - Synchronization of lastSeen - Joining of async tasks - Minor logging improvements Addition: - Create setIcmpPingMethod() * Partial refactoring of NetworkDiscoveryService Fixes: - Correct the number of addresses in a /24 subnet. - Restructure task creation so that one less thread is needed per scanned address, and so that startScan() is guaranteed to return quickly. - Create independent threads for posting discovery results, as this can take a long time and might not finish before the executor shuts down. * Make NetworkHandler thread-safe * Make SpeedTestHandler thread-safe * Make sure that process output is consumed * Implement tool-specific latency parsing * Fix NetworkDiscoveryService configuration and make the thread pool size configurable * i18n * Fix test Signed-off-by: Leo Siepel <leosiepel@gmail.com> Co-authored-by: Wouter Born <github@maindrain.net> Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com>

* [network] Fix discovery performance causing a slow openHAB start (#17972) * Fix performance * Improvements * Add logging * Improve logging * Update bundles/org.openhab.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java * Improve thread handling * Improve shutdown * thread cleanup * Improve per thread allocation * Stop on finishing all interfaces * Change Arping to make use of completeablefeature * Seperate presence detection from lifecycle * Improve logging and filtering * Create seperate ScheduledExecutorService * Fix review comment * Improve network address checks * Improve interrupthandling * Revert threadlocal * Refactor Presencedetection * Make LatencyParser accept Windows' <1ms syntax for minimal latency * Remove misleading reference to non-existing NetworkHandlerBuilder * Handle thread-safety of NetworkDiscoveryService.executorService * Fix network interface exclusion * Make PresenceDetectionValue thread-safe * Partial refactoring of PresenceDetection Fixes: - Synchronization of lastSeen - Joining of async tasks - Minor logging improvements Addition: - Create setIcmpPingMethod() * Partial refactoring of NetworkDiscoveryService Fixes: - Correct the number of addresses in a /24 subnet. - Restructure task creation so that one less thread is needed per scanned address, and so that startScan() is guaranteed to return quickly. - Create independent threads for posting discovery results, as this can take a long time and might not finish before the executor shuts down. * Make NetworkHandler thread-safe * Make SpeedTestHandler thread-safe * Make sure that process output is consumed * Implement tool-specific latency parsing * Fix NetworkDiscoveryService configuration and make the thread pool size configurable * i18n * Fix test Signed-off-by: Leo Siepel <leosiepel@gmail.com> Co-authored-by: Wouter Born <github@maindrain.net> Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com> * [network] Use a threaded consumer for Process output consumption (#19398) * Use a threaded consumer for Process output consumption Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com> --------- Signed-off-by: Leo Siepel <leosiepel@gmail.com> Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com> Co-authored-by: lsiepel <leosiepel@gmail.com> Co-authored-by: Wouter Born <github@maindrain.net> Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com>

lsiepel added the bug An unexpected problem or unintended behavior of an add-on label Dec 24, 2024

lsiepel requested review from a team and wborn December 24, 2024 14:11

wborn reviewed May 30, 2025

View reviewed changes

...b.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java Outdated Show resolved Hide resolved

wborn requested a review from Copilot August 3, 2025 06:03

This comment was marked as outdated.

Sign in to view

lsiepel requested a review from Copilot August 29, 2025 07:47

This comment was marked as resolved.

Sign in to view

lsiepel mentioned this pull request Aug 29, 2025

[network] PresenceDetectionTest unstable #14472

Closed

Nadahar reviewed Aug 29, 2025

View reviewed changes

...rk/src/main/java/org/openhab/binding/network/internal/discovery/NetworkDiscoveryService.java Show resolved Hide resolved

Nadahar reviewed Aug 29, 2025

View reviewed changes

...rk/src/main/java/org/openhab/binding/network/internal/discovery/NetworkDiscoveryService.java Outdated Show resolved Hide resolved

Nadahar reviewed Aug 29, 2025

View reviewed changes

...rk/src/main/java/org/openhab/binding/network/internal/discovery/NetworkDiscoveryService.java Outdated Show resolved Hide resolved

Nadahar reviewed Aug 29, 2025

View reviewed changes

...rk/src/main/java/org/openhab/binding/network/internal/discovery/NetworkDiscoveryService.java Outdated Show resolved Hide resolved

lsiepel force-pushed the network-performance branch from 2c4962d to 0e7e4cd Compare August 30, 2025 18:59

lsiepel added 3 commits August 30, 2025 21:00

Fix performance

82a378a

Signed-off-by: Leo Siepel <leosiepel@gmail.com>

Improvements

2080346

Signed-off-by: Leo Siepel <leosiepel@gmail.com>

Add logging

dcce737

Signed-off-by: Leo Siepel <leosiepel@gmail.com>

lsiepel marked this pull request as ready for review September 6, 2025 06:02

i18n

75c1e07

Signed-off-by: Leo Siepel <leosiepel@gmail.com>

lolodomo reviewed Sep 6, 2025

View reviewed changes

bundles/org.openhab.binding.network/src/main/resources/OH-INF/addon/addon.xml Show resolved Hide resolved

Review comments

e36ff67

Signed-off-by: Leo Siepel <leosiepel@gmail.com>

lsiepel force-pushed the network-performance branch from bcad22e to e36ff67 Compare September 6, 2025 08:54

Fix test

b4a70da

Signed-off-by: Leo Siepel <leosiepel@gmail.com>

Remove restart from doc

8664e05

Signed-off-by: Leo Siepel <leosiepel@gmail.com>

lolodomo reviewed Sep 7, 2025

View reviewed changes

lolodomo approved these changes Sep 7, 2025

View reviewed changes

lolodomo merged commit a5fcfe4 into openhab:main Sep 7, 2025
2 checks passed

lolodomo added this to the 5.1 milestone Sep 7, 2025

lsiepel deleted the network-performance branch September 7, 2025 19:55

Nadahar mentioned this pull request Sep 20, 2025

Isolate discovery result registrations from binding threads openhab/openhab-core#5032

Merged

reyhard mentioned this pull request Sep 27, 2025

[network] Pinging fails when there are more than 5 things #19397

Closed

Nadahar mentioned this pull request Sep 29, 2025

[network] Fix case where state no longer updates #19398

Merged

Nadahar mentioned this pull request Oct 3, 2025

[network] Performance refactoring 5.0.x #19427

Merged

Nadahar mentioned this pull request Oct 3, 2025

[network] Performance refactoring 4.3.x #19428

Merged

lsiepel added backported A PR that has been cherry-picked to a patch release branch backported4 Backported patches to the 4.x branch. Used for constructing release notes. labels Oct 11, 2025

Uh oh!

[network] Fix discovery performance causing a slow openHAB start #17972

[network] Fix discovery performance causing a slow openHAB start #17972

Uh oh!

Conversation

lsiepel commented Dec 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lolodomo commented Jul 12, 2025

Uh oh!

lsiepel commented Jul 12, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lsiepel commented Aug 30, 2025

Uh oh!

Nadahar commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nadahar commented Aug 30, 2025

Uh oh!

Nadahar commented Aug 30, 2025

Uh oh!

Nadahar commented Aug 30, 2025

Uh oh!

lsiepel commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nadahar commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nadahar commented Aug 30, 2025

Uh oh!

Nadahar commented Aug 30, 2025

Uh oh!

lsiepel commented Aug 30, 2025

Uh oh!

lsiepel commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lolodomo commented Sep 6, 2025

Uh oh!

lolodomo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Nadahar commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lolodomo left a comment

Choose a reason for hiding this comment

Uh oh!

lolodomo Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lsiepel Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lolodomo Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

Nadahar Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

lolodomo commented Sep 7, 2025

Uh oh!

Nadahar commented Sep 7, 2025

lsiepel commented Dec 24, 2024 •

edited

Loading

Nadahar commented Aug 30, 2025 •

edited

Loading

lsiepel commented Aug 30, 2025 •

edited

Loading

Nadahar commented Aug 30, 2025 •

edited

Loading

lsiepel commented Aug 30, 2025 •

edited

Loading

Nadahar commented Sep 6, 2025 •

edited

Loading

lolodomo Sep 7, 2025 •

edited

Loading

lsiepel Sep 7, 2025 •

edited

Loading