Skip to content

Conversation

lsiepel
Copy link
Contributor

@lsiepel lsiepel commented Dec 24, 2024

No breaking changes.
Improved logging
Improved separation of concerns
Improved parsing of ping results (<1ms)
Improved resource use (memory and threads)
Discovery takes binding configuration into consideration

should be backported to 5.0.x and 4.3.x.

Testable jar 5.1.0 snapshot (2025-09-05) or later: https://1drv.ms/u/c/70ea7ac46bc61c73/EegJadzMpp1IlQYIN5kMCfwBpyUEcQF8ReQZGRMTmTcllw?e=TQEpxt

@lsiepel lsiepel added the bug An unexpected problem or unintended behavior of an add-on label Dec 24, 2024
@lsiepel lsiepel requested review from a team and wborn December 24, 2024 14:11
@lolodomo
Copy link
Contributor

@lsiepel : this PR is still not yet ready for review?

@lsiepel
Copy link
Contributor Author

lsiepel commented Jul 12, 2025

@lsiepel : this PR is still not yet ready for review?

No, hard issue. I can’t fully reproduce this issue, so the fixes I propose are merely theoretically.

@wborn wborn requested a review from Copilot August 3, 2025 06:03
Copilot

This comment was marked as outdated.

@lsiepel lsiepel requested a review from Copilot August 29, 2025 07:47
Copilot

This comment was marked as resolved.

@lsiepel
Copy link
Contributor Author

lsiepel commented Aug 30, 2025

@Nadahar i really appreciate your input here and i need some more.

Previously i moved the scheduled task creation from the presencedetection to the handler. A better place for lifecycle management. I also created a dedicated threadpool similar to what you did for chromecast. The threadpool made it worse according to #17956 (comment)

The discovery has made some good improvements, but the startup is still problematic. I can't figure out what is actually blocking this so much.

@Nadahar
Copy link
Contributor

Nadahar commented Aug 30, 2025

I'll try to help if I can, but I need to spend the time required to study "the whole treading structure" before I can really suggest if and how it could be improved.

I discovered that I have posted some posts in #17956 that I intended to post here, so I'll post them here now and then delete them from the issue, as they are really way too technical to be in the issue.

@Nadahar
Copy link
Contributor

Nadahar commented Aug 30, 2025

Exactly the same as i discovered, see https://github.com/openhab/openhab-addons/pull/17972/files#diff-650862c0ba8b30dccfb5784d3dcacc002ea13fb21a9acb5ec2d4dbc32f95af7f (line 221)

👍

Note: I think you can use shutdownNow() from the beginning, you want to stop the scan. Using just shutdown() won't attempt to interrupt those already running, and I think it won't even remove those still queued, it just won't accept new tasks. shotdownNow() will at least try to interrupt those already running, although most network operations aren't interruptible. But it will drain the queue if some tasks are still queued, which shutdown() doesn't do.

You don't really have a "hard kill" option (or, you do with Thread.stop() but that will cause a mess), so they will most likely just finish what they are doing anyway. For interrupts to work, the tasks themselves must be written to handle interrupts properly (that is, by existing) and might also need to check if interrupted before starting long-running operations.

@Nadahar
Copy link
Contributor

Nadahar commented Aug 30, 2025

I found another potential issue. Again, I have no idea if this actually causes any of the current problems, but: NetworkUtils.nativePing() might deadlock.

GitHub won't let me comment in that part of the code, since no changes have been done there, but, it uses ProcessBuilder:

                proc = new ProcessBuilder("ping", "-w", String.valueOf(timeout.toSeconds()), "-c", "1", hostname)
                        .start();

A Process in java generates two outputs (in the form of InputStreams) that must be drained as the process runs, or the process might "hang" waiting for the pipe to accept its output. This must normally be separate threads to make sure nothing will prevent them from draining the streams, however, if you don't care about what output is "error" and what is "normal", you can merge them into one by doing this:

                proc = new ProcessBuilder("ping", "-w", String.valueOf(timeout.toSeconds()), "-c", "1", hostname)
                        .redirectErrorStream(true).start();

Once that is done, it's enough to drain this "merged" stream from Process.getInputStream(), but this isn't done here, which means that if any stderr is produced, it will never be drained.

These pipes/streams might have some buffering, but that is OS/implementation dependent and can't be expected to "save you" (although that's the only reason this works at all).

Later down in the code:

        int result = proc.waitFor();
        if (result != 0) {
            return new PingResult(false, Duration.between(execStartTime, Instant.now()));
        }

        try (BufferedReader r = new BufferedReader(new InputStreamReader(proc.getInputStream()))) {

proc.waitFor() is called before the draining of (one of) the pipe(s) starts, which will hang if the buffer can't handle the output. You're supposed to read the pipes/streams while the process is running. Even worse, if result != 0, the streams/pipes won't be drained at all.

Further down, the reading also might exit before being finished, which is also wrong.

IMO, stdout and stderr should be merged, the resulting stream read/drained into a collection of strings (probably one string per line) and then you could potentially wait for the process to finish (but I don't think the stream will end until the process has finished).

Once this is done, results can be parsed and the appropriate action taken.

@Nadahar
Copy link
Contributor

Nadahar commented Aug 30, 2025

I'd also say that this is somewhat questionable:

    public void wakeUpIOS(InetAddress address) throws IOException {
        int port = 5353;
        try (DatagramSocket s = new DatagramSocket()) {
            // Send a valid mDNS packet (12 bytes of zeroes)
            byte[] buffer = new byte[12];
            s.send(new DatagramPacket(buffer, buffer.length, address, port));
            logger.trace("Sent packet to {}:{} to wake up iOS device", address, port);
        } catch (PortUnreachableException e) {
            logger.trace("Unable to send packet to wake up iOS device at {}:{}", address, port, e);
        }
    }

I get that it's a "cheat" to wake up sleeping iOS devices, but it's also a mDNS packet that could cause trouble. OH itself, and many other devices, use mDNS, and since it's not known in advance what device is iOS, I assume this is sent to all. This will interfere with the genuine mDNS communication OH does, and I could also imagine that this could be related to "mDNS storms" that I've read that OH can cause.

@lsiepel
Copy link
Contributor Author

lsiepel commented Aug 30, 2025

I think you can use shutdownNow() from the beginning, you want to stop the scan.

An easy change to test, but i can't see yet how this would cause the experienced problems.

NetworkUtils.nativePing() might deadlock.

pff, i need to inspect how this works :-) give me some time to understand.

I'd also say that this is somewhat questionable

For the discovery proces this might be true, for regular things that initalize, the wakeUpIOS() call is very specific to that device, this should not cause any effect at all to mDNS. For Discovery i can do a test to see the impact.

From #17956 (comment) it looks like the initalization of each device has a sort of fixed time. With parallellism i would expect to have a bursts if initialisations.

@Nadahar
Copy link
Contributor

Nadahar commented Aug 30, 2025

An easy change to test, but i can't see yet how this would cause the experienced problems.

No, I don't have enough overview (yet) to really have an opinion about causes the delays, so I've just written down things as I've seen them while studying the code.

For the discovery proces this might be true, for regular things that initalize, the wakeUpIOS() call is very specific to that device, this should not cause any effect at all to mDNS. For Discovery i can do a test to see the impact.

Yes, I was thinking of the discovery scan. It's hardcoded to true there, which means that there will be sent a lot of such packages, which again might trigger responses from other devices. I don't know enough about the mDNS to figure out what an all-zero packet might do/cause, but it seems "a very crude thing to do". When you have a device that you have configured as an iOS device, and you wish for this to happen, I see no problem. That's only one packet anyway, and if the iOS devices expect these, they probably won't "respond negatively" to them anyway.

@Nadahar
Copy link
Contributor

Nadahar commented Aug 30, 2025

pff, i need to inspect how this works :-) give me some time to understand.

Take your time, but FYI when you get there: A Process creates three OS level pipes (I think stdin is a pipe as well, but as it's not problematic we can ignore it), two of which are stdout and stderr. When you run this from a console/terminal, these will normally be printed in the console, unless you have redirected them somewhere else. The problem is that, although there will often be some buffering somewhere (either in the JVM, the OS or in the program producing the output), there's no guarantee that any buffering exists in this pipeline (and I've absolutely experienced this, so it's not just hypothetical - a made a "test" to cause this to happen for the Exec binding).

If the running program/process tries to write to one of these pipes and there is no buffering, or the buffer is full, their "write command" will block until there is space in the pipe. This is a normal OS "blocking I/O" operation, and will apply to any process that tries to write when there's no space to do so. Thus, the Java program that "handles" this process must make sure to always consume whatever is written to these pipes while the process is running, or it could "block forever"/deadlock. That is not what is done here, instead stdout might be partially or fully consumed after the process is finished, or it might not be consumed at all. stderr is completely ignored. One can only imagine all the problems this might cause, processes that "never finish", pipes that are never fully consumed...

After a quick look, it seems like the "arpping" command handles Process in the same way.

This isn't necessarily difficult to fix, but testing that it works properly under all circumstances can be daunting. That said, we can already "know" that the current code won't work properly under all circumstances, so it might still be worth making something more likely to work even if we're not able to test it under enough variety of circumstances.

I could consider making a PR against your PR for this part, so that you don't need to think about the process handling on top of the treading stuff.

@Nadahar
Copy link
Contributor

Nadahar commented Aug 30, 2025

@lsiepel I've checked out your branch and am trying to do some debugging. But, I'm having a lot of trouble because this branch is on 5.0 while my core repo is checked out at latest/5.1.

I could easily rebase the branch to 5.1, but that would make it "impossible" to create PRs from my branch into yours. I could check out 5.0 core, but I'd really prefer not to, because Eclipse is going to spend hours rebuilding everything.

Is there a reason why this is still at 5.0, or could you just rebase it on latest main to make things easier?

@lsiepel
Copy link
Contributor Author

lsiepel commented Aug 30, 2025

I can rebase, just have some commits in my main I need to preserve. First diner and kids to bed 😂

@lsiepel
Copy link
Contributor Author

lsiepel commented Aug 30, 2025

I can rebase, just have some commits in my main I need to preserve. First diner and kids to bed 😂

done (i first did a merge commit, but removed and now did a rebase)

@lsiepel lsiepel force-pushed the network-performance branch from 2c4962d to 0e7e4cd Compare August 30, 2025 18:59
Signed-off-by: Leo Siepel <leosiepel@gmail.com>
Signed-off-by: Leo Siepel <leosiepel@gmail.com>
Signed-off-by: Leo Siepel <leosiepel@gmail.com>
@lsiepel lsiepel marked this pull request as ready for review September 6, 2025 06:02
@lolodomo
Copy link
Contributor

lolodomo commented Sep 6, 2025

@lsiepel : please update (regenerate) the i18n properties file network.properties.

Signed-off-by: Leo Siepel <leosiepel@gmail.com>
Copy link
Contributor

@lolodomo lolodomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review part 1 of 2

Signed-off-by: Leo Siepel <leosiepel@gmail.com>
@lsiepel lsiepel force-pushed the network-performance branch from bcad22e to e36ff67 Compare September 6, 2025 08:54
Signed-off-by: Leo Siepel <leosiepel@gmail.com>
@Nadahar
Copy link
Contributor

Nadahar commented Sep 6, 2025

@lsiepel Since you added mocking of the admin service, you can remove the nullness annotation on the admin service reference in the constructor, and the corresponding null handling following that in the code.

edit: Oh, I see, you already did this in reverse order 😉 There's a reason why I left that @Nullable, it was because of the tests.

Signed-off-by: Leo Siepel <leosiepel@gmail.com>
Copy link
Contributor

@lolodomo lolodomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last part of the review

Comment on lines +611 to +613
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return;
Copy link
Contributor

@lolodomo lolodomo Sep 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why catching the interruption exception if this is just to retrigger it ?

Same comment for other places in the same file.

Copy link
Contributor Author

@lsiepel lsiepel Sep 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not retrigger the exception, as in it does not re-throw the exceptions. The exception is catched, thread status is restored and the method is immediate returned to abort fast.
This is a best practice for gracefull termination. What else would you recommend?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, my sentence was not enough precise, the exception is not retriggered by your code but your code is requesting the thread to interrupt again and so this will lead to a new interruption exception a moment later. in the code
That looks a little strange to me but if this is the best practice, that is fine for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, InterruptedException isn't well understood by many. It will only ever be thrown if the thread is interrupted, and that's usually only if an executor is being shut down or if the program terminates. The JVM will interrupt all (non-daemon I think) threads as a part of the shutdown process, it's how a graceful shutdown is supposed to work.

So, generally, when interrupted, the goal is that the code should wrap up whatever it is doing and exit as soon as possible. A thread won't stop running because it is interrupted, "normal code" will keep running (it is possible for code to check isInterrupted() manually before starting long/heavy operations, but I very rarely see this done). But, most operations that are blocking will use InterruptedException as really the only way they can be "aborted".

When you catch a InterruptedException, the interrupted flag of the thread is cleared. I guess those that designed it though that it was a good idea because it would give the flexibility to use interrupts "for program flow", but I bet they didn't anticipate how few would actually familiarize themselves with how this works. Because, if the goal really is to terminate what it's doing, like here, you need the flag to "stay on" until the task has exited completely. Otherwise, it will just interrupt what it's doing right there, and then return to the calling code, which might contain other blocking operations that would then block and prevent the task from exiting.

This is why it's "recommended" to set the interrupted flag when you catch InterruptedException, when you do that, you make sure that the code will again be interrupted if it tries to sleep/wait for something.

@lolodomo
Copy link
Contributor

lolodomo commented Sep 7, 2025

@Nadahar : is it ok for you ?

@Nadahar
Copy link
Contributor

Nadahar commented Sep 7, 2025

@Nadahar : is it ok for you ?

Yes, as far as I know, this is good to go.

Copy link
Contributor

@lolodomo lolodomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lolodomo lolodomo merged commit a5fcfe4 into openhab:main Sep 7, 2025
2 checks passed
@lolodomo lolodomo added this to the 5.1 milestone Sep 7, 2025
@lsiepel lsiepel deleted the network-performance branch September 7, 2025 19:55
@openhab-bot
Copy link
Collaborator

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/jmdns-memory-leak/166318/4

@openhab-bot
Copy link
Collaborator

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/openhab-5-1-milestone-discussion/166385/36

Nadahar pushed a commit to Nadahar/openhab-addons that referenced this pull request Oct 3, 2025
…nhab#17972)

* Fix performance
* Improvements
* Add logging
* Improve logging
* Update bundles/org.openhab.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java
* Improve thread handling
* Improve shutdown
* thread cleanup
* Improve per thread allocation
* Stop on finishing all interfaces
* Change Arping to make use of completeablefeature
* Seperate presence detection from lifecycle
* Improve logging and filtering
* Create seperate ScheduledExecutorService
* Fix review comment
* Improve network address checks
* Improve interrupthandling
* Revert threadlocal
* Refactor Presencedetection
* Make LatencyParser accept Windows' <1ms syntax for minimal latency
* Remove misleading reference to non-existing NetworkHandlerBuilder
* Handle thread-safety of NetworkDiscoveryService.executorService
* Fix network interface exclusion
* Make PresenceDetectionValue thread-safe
* Partial refactoring of PresenceDetection

Fixes:
  - Synchronization of lastSeen
  - Joining of async tasks
  - Minor logging improvements
Addition:
  - Create setIcmpPingMethod()

* Partial refactoring of NetworkDiscoveryService

Fixes:
  - Correct the number of addresses in a /24 subnet.
  - Restructure task creation so that one less thread is needed per scanned address, and so that startScan() is guaranteed to return quickly.
  - Create independent threads for posting discovery results, as this can take a long time and might not finish before the executor shuts down.

* Make NetworkHandler thread-safe
* Make SpeedTestHandler thread-safe
* Make sure that process output is consumed
* Implement tool-specific latency parsing
* Fix NetworkDiscoveryService configuration and make the thread pool size configurable
* i18n
* Fix test

Signed-off-by: Leo Siepel <leosiepel@gmail.com>
Co-authored-by: Wouter Born <github@maindrain.net>
Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com>
Nadahar pushed a commit to Nadahar/openhab-addons that referenced this pull request Oct 3, 2025
…nhab#17972)

* Fix performance
* Improvements
* Add logging
* Improve logging
* Update bundles/org.openhab.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java
* Improve thread handling
* Improve shutdown
* thread cleanup
* Improve per thread allocation
* Stop on finishing all interfaces
* Change Arping to make use of completeablefeature
* Seperate presence detection from lifecycle
* Improve logging and filtering
* Create seperate ScheduledExecutorService
* Fix review comment
* Improve network address checks
* Improve interrupthandling
* Revert threadlocal
* Refactor Presencedetection
* Make LatencyParser accept Windows' <1ms syntax for minimal latency
* Remove misleading reference to non-existing NetworkHandlerBuilder
* Handle thread-safety of NetworkDiscoveryService.executorService
* Fix network interface exclusion
* Make PresenceDetectionValue thread-safe
* Partial refactoring of PresenceDetection

Fixes:
  - Synchronization of lastSeen
  - Joining of async tasks
  - Minor logging improvements
Addition:
  - Create setIcmpPingMethod()

* Partial refactoring of NetworkDiscoveryService

Fixes:
  - Correct the number of addresses in a /24 subnet.
  - Restructure task creation so that one less thread is needed per scanned address, and so that startScan() is guaranteed to return quickly.
  - Create independent threads for posting discovery results, as this can take a long time and might not finish before the executor shuts down.

* Make NetworkHandler thread-safe
* Make SpeedTestHandler thread-safe
* Make sure that process output is consumed
* Implement tool-specific latency parsing
* Fix NetworkDiscoveryService configuration and make the thread pool size configurable
* i18n
* Fix test

Signed-off-by: Leo Siepel <leosiepel@gmail.com>
Co-authored-by: Wouter Born <github@maindrain.net>
Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com>
DrRSatzteil pushed a commit to DrRSatzteil/openhab-addons that referenced this pull request Oct 11, 2025
…nhab#17972)

* Fix performance
* Improvements
* Add logging
* Improve logging
* Update bundles/org.openhab.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java
* Improve thread handling
* Improve shutdown
* thread cleanup
* Improve per thread allocation
* Stop on finishing all interfaces
* Change Arping to make use of completeablefeature
* Seperate presence detection from lifecycle
* Improve logging and filtering
* Create seperate ScheduledExecutorService
* Fix review comment
* Improve network address checks
* Improve interrupthandling
* Revert threadlocal
* Refactor Presencedetection
* Make LatencyParser accept Windows' <1ms syntax for minimal latency
* Remove misleading reference to non-existing NetworkHandlerBuilder
* Handle thread-safety of NetworkDiscoveryService.executorService
* Fix network interface exclusion
* Make PresenceDetectionValue thread-safe
* Partial refactoring of PresenceDetection

Fixes:
  - Synchronization of lastSeen
  - Joining of async tasks
  - Minor logging improvements
Addition:
  - Create setIcmpPingMethod()

* Partial refactoring of NetworkDiscoveryService

Fixes:
  - Correct the number of addresses in a /24 subnet.
  - Restructure task creation so that one less thread is needed per scanned address, and so that startScan() is guaranteed to return quickly.
  - Create independent threads for posting discovery results, as this can take a long time and might not finish before the executor shuts down.

* Make NetworkHandler thread-safe
* Make SpeedTestHandler thread-safe
* Make sure that process output is consumed
* Implement tool-specific latency parsing
* Fix NetworkDiscoveryService configuration and make the thread pool size configurable
* i18n
* Fix test

Signed-off-by: Leo Siepel <leosiepel@gmail.com>
Co-authored-by: Wouter Born <github@maindrain.net>
Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com>
lsiepel added a commit that referenced this pull request Oct 11, 2025
* [network] Fix discovery performance causing a slow openHAB start (#17972)

* Fix performance
* Improvements
* Add logging
* Improve logging
* Update bundles/org.openhab.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java
* Improve thread handling
* Improve shutdown
* thread cleanup
* Improve per thread allocation
* Stop on finishing all interfaces
* Change Arping to make use of completeablefeature
* Seperate presence detection from lifecycle
* Improve logging and filtering
* Create seperate ScheduledExecutorService
* Fix review comment
* Improve network address checks
* Improve interrupthandling
* Revert threadlocal
* Refactor Presencedetection
* Make LatencyParser accept Windows' <1ms syntax for minimal latency
* Remove misleading reference to non-existing NetworkHandlerBuilder
* Handle thread-safety of NetworkDiscoveryService.executorService
* Fix network interface exclusion
* Make PresenceDetectionValue thread-safe
* Partial refactoring of PresenceDetection

Fixes:
  - Synchronization of lastSeen
  - Joining of async tasks
  - Minor logging improvements
Addition:
  - Create setIcmpPingMethod()

* Partial refactoring of NetworkDiscoveryService

Fixes:
  - Correct the number of addresses in a /24 subnet.
  - Restructure task creation so that one less thread is needed per scanned address, and so that startScan() is guaranteed to return quickly.
  - Create independent threads for posting discovery results, as this can take a long time and might not finish before the executor shuts down.

* Make NetworkHandler thread-safe
* Make SpeedTestHandler thread-safe
* Make sure that process output is consumed
* Implement tool-specific latency parsing
* Fix NetworkDiscoveryService configuration and make the thread pool size configurable
* i18n
* Fix test

Signed-off-by: Leo Siepel <leosiepel@gmail.com>
Co-authored-by: Wouter Born <github@maindrain.net>
Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com>

* [network] Use a threaded consumer for Process output consumption (#19398)

* Use a threaded consumer for Process output consumption

Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>

---------

Signed-off-by: Leo Siepel <leosiepel@gmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Co-authored-by: lsiepel <leosiepel@gmail.com>
Co-authored-by: Wouter Born <github@maindrain.net>
Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com>
@lsiepel lsiepel added backported A PR that has been cherry-picked to a patch release branch backported4 Backported patches to the 4.x branch. Used for constructing release notes. labels Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backported A PR that has been cherry-picked to a patch release branch backported4 Backported patches to the 4.x branch. Used for constructing release notes. bug An unexpected problem or unintended behavior of an add-on

Projects

None yet

6 participants