-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[network] Fix discovery performance causing a slow openHAB start #17972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...b.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java
Outdated
Show resolved
Hide resolved
@lsiepel : this PR is still not yet ready for review? |
No, hard issue. I can’t fully reproduce this issue, so the fixes I propose are merely theoretically. |
...rk/src/main/java/org/openhab/binding/network/internal/discovery/NetworkDiscoveryService.java
Show resolved
Hide resolved
...rk/src/main/java/org/openhab/binding/network/internal/discovery/NetworkDiscoveryService.java
Outdated
Show resolved
Hide resolved
...rk/src/main/java/org/openhab/binding/network/internal/discovery/NetworkDiscoveryService.java
Outdated
Show resolved
Hide resolved
...rk/src/main/java/org/openhab/binding/network/internal/discovery/NetworkDiscoveryService.java
Outdated
Show resolved
Hide resolved
@Nadahar i really appreciate your input here and i need some more. Previously i moved the scheduled task creation from the presencedetection to the handler. A better place for lifecycle management. I also created a dedicated threadpool similar to what you did for chromecast. The threadpool made it worse according to #17956 (comment) The discovery has made some good improvements, but the startup is still problematic. I can't figure out what is actually blocking this so much. |
I'll try to help if I can, but I need to spend the time required to study "the whole treading structure" before I can really suggest if and how it could be improved. I discovered that I have posted some posts in #17956 that I intended to post here, so I'll post them here now and then delete them from the issue, as they are really way too technical to be in the issue. |
👍 Note: I think you can use You don't really have a "hard kill" option (or, you do with |
I found another potential issue. Again, I have no idea if this actually causes any of the current problems, but: GitHub won't let me comment in that part of the code, since no changes have been done there, but, it uses proc = new ProcessBuilder("ping", "-w", String.valueOf(timeout.toSeconds()), "-c", "1", hostname)
.start(); A proc = new ProcessBuilder("ping", "-w", String.valueOf(timeout.toSeconds()), "-c", "1", hostname)
.redirectErrorStream(true).start(); Once that is done, it's enough to drain this "merged" stream from These pipes/streams might have some buffering, but that is OS/implementation dependent and can't be expected to "save you" (although that's the only reason this works at all). Later down in the code: int result = proc.waitFor();
if (result != 0) {
return new PingResult(false, Duration.between(execStartTime, Instant.now()));
}
try (BufferedReader r = new BufferedReader(new InputStreamReader(proc.getInputStream()))) {
Further down, the reading also might exit before being finished, which is also wrong. IMO, Once this is done, results can be parsed and the appropriate action taken. |
I'd also say that this is somewhat questionable: public void wakeUpIOS(InetAddress address) throws IOException {
int port = 5353;
try (DatagramSocket s = new DatagramSocket()) {
// Send a valid mDNS packet (12 bytes of zeroes)
byte[] buffer = new byte[12];
s.send(new DatagramPacket(buffer, buffer.length, address, port));
logger.trace("Sent packet to {}:{} to wake up iOS device", address, port);
} catch (PortUnreachableException e) {
logger.trace("Unable to send packet to wake up iOS device at {}:{}", address, port, e);
}
} I get that it's a "cheat" to wake up sleeping iOS devices, but it's also a mDNS packet that could cause trouble. OH itself, and many other devices, use mDNS, and since it's not known in advance what device is iOS, I assume this is sent to all. This will interfere with the genuine mDNS communication OH does, and I could also imagine that this could be related to "mDNS storms" that I've read that OH can cause. |
An easy change to test, but i can't see yet how this would cause the experienced problems.
pff, i need to inspect how this works :-) give me some time to understand.
For the discovery proces this might be true, for regular things that initalize, the From #17956 (comment) it looks like the initalization of each device has a sort of fixed time. With parallellism i would expect to have a bursts if initialisations. |
No, I don't have enough overview (yet) to really have an opinion about causes the delays, so I've just written down things as I've seen them while studying the code.
Yes, I was thinking of the discovery scan. It's hardcoded to |
Take your time, but FYI when you get there: A If the running program/process tries to write to one of these pipes and there is no buffering, or the buffer is full, their "write command" will block until there is space in the pipe. This is a normal OS "blocking I/O" operation, and will apply to any process that tries to write when there's no space to do so. Thus, the Java program that "handles" this process must make sure to always consume whatever is written to these pipes while the process is running, or it could "block forever"/deadlock. That is not what is done here, instead After a quick look, it seems like the "arpping" command handles This isn't necessarily difficult to fix, but testing that it works properly under all circumstances can be daunting. That said, we can already "know" that the current code won't work properly under all circumstances, so it might still be worth making something more likely to work even if we're not able to test it under enough variety of circumstances. I could consider making a PR against your PR for this part, so that you don't need to think about the process handling on top of the treading stuff. |
@lsiepel I've checked out your branch and am trying to do some debugging. But, I'm having a lot of trouble because this branch is on 5.0 while my core repo is checked out at latest/5.1. I could easily rebase the branch to 5.1, but that would make it "impossible" to create PRs from my branch into yours. I could check out 5.0 core, but I'd really prefer not to, because Eclipse is going to spend hours rebuilding everything. Is there a reason why this is still at 5.0, or could you just rebase it on latest |
I can rebase, just have some commits in my main I need to preserve. First diner and kids to bed 😂 |
done (i first did a merge commit, but removed and now did a rebase) |
2c4962d
to
0e7e4cd
Compare
Signed-off-by: Leo Siepel <leosiepel@gmail.com>
Signed-off-by: Leo Siepel <leosiepel@gmail.com>
Signed-off-by: Leo Siepel <leosiepel@gmail.com>
@lsiepel : please update (regenerate) the i18n properties file network.properties. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review part 1 of 2
...inding.network/src/main/java/org/openhab/binding/network/internal/NetworkHandlerFactory.java
Show resolved
Hide resolved
...rk/src/main/java/org/openhab/binding/network/internal/discovery/NetworkDiscoveryService.java
Outdated
Show resolved
Hide resolved
...rk/src/main/java/org/openhab/binding/network/internal/discovery/NetworkDiscoveryService.java
Show resolved
Hide resolved
...nding.network/src/main/java/org/openhab/binding/network/internal/handler/NetworkHandler.java
Show resolved
Hide resolved
bundles/org.openhab.binding.network/src/main/resources/OH-INF/addon/addon.xml
Show resolved
Hide resolved
Signed-off-by: Leo Siepel <leosiepel@gmail.com>
bcad22e
to
e36ff67
Compare
@lsiepel Since you added mocking of the admin service, you can remove the nullness annotation on the admin service reference in the constructor, and the corresponding edit: Oh, I see, you already did this in reverse order 😉 There's a reason why I left that |
Signed-off-by: Leo Siepel <leosiepel@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last part of the review
} catch (InterruptedException e) { | ||
Thread.currentThread().interrupt(); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why catching the interruption exception if this is just to retrigger it ?
Same comment for other places in the same file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not retrigger the exception, as in it does not re-throw the exceptions. The exception is catched, thread status is restored and the method is immediate returned to abort fast.
This is a best practice for gracefull termination. What else would you recommend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, my sentence was not enough precise, the exception is not retriggered by your code but your code is requesting the thread to interrupt again and so this will lead to a new interruption exception a moment later. in the code
That looks a little strange to me but if this is the best practice, that is fine for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, InterruptedException
isn't well understood by many. It will only ever be thrown if the thread is interrupted, and that's usually only if an executor is being shut down or if the program terminates. The JVM will interrupt all (non-daemon I think) threads as a part of the shutdown process, it's how a graceful shutdown is supposed to work.
So, generally, when interrupted, the goal is that the code should wrap up whatever it is doing and exit as soon as possible. A thread won't stop running because it is interrupted, "normal code" will keep running (it is possible for code to check isInterrupted()
manually before starting long/heavy operations, but I very rarely see this done). But, most operations that are blocking will use InterruptedException
as really the only way they can be "aborted".
When you catch a InterruptedException
, the interrupted flag of the thread is cleared. I guess those that designed it though that it was a good idea because it would give the flexibility to use interrupts "for program flow", but I bet they didn't anticipate how few would actually familiarize themselves with how this works. Because, if the goal really is to terminate what it's doing, like here, you need the flag to "stay on" until the task has exited completely. Otherwise, it will just interrupt what it's doing right there, and then return to the calling code, which might contain other blocking operations that would then block and prevent the task from exiting.
This is why it's "recommended" to set the interrupted flag when you catch InterruptedException
, when you do that, you make sure that the code will again be interrupted if it tries to sleep/wait for something.
@Nadahar : is it ok for you ? |
Yes, as far as I know, this is good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This pull request has been mentioned on openHAB Community. There might be relevant details there: |
This pull request has been mentioned on openHAB Community. There might be relevant details there: https://community.openhab.org/t/openhab-5-1-milestone-discussion/166385/36 |
…nhab#17972) * Fix performance * Improvements * Add logging * Improve logging * Update bundles/org.openhab.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java * Improve thread handling * Improve shutdown * thread cleanup * Improve per thread allocation * Stop on finishing all interfaces * Change Arping to make use of completeablefeature * Seperate presence detection from lifecycle * Improve logging and filtering * Create seperate ScheduledExecutorService * Fix review comment * Improve network address checks * Improve interrupthandling * Revert threadlocal * Refactor Presencedetection * Make LatencyParser accept Windows' <1ms syntax for minimal latency * Remove misleading reference to non-existing NetworkHandlerBuilder * Handle thread-safety of NetworkDiscoveryService.executorService * Fix network interface exclusion * Make PresenceDetectionValue thread-safe * Partial refactoring of PresenceDetection Fixes: - Synchronization of lastSeen - Joining of async tasks - Minor logging improvements Addition: - Create setIcmpPingMethod() * Partial refactoring of NetworkDiscoveryService Fixes: - Correct the number of addresses in a /24 subnet. - Restructure task creation so that one less thread is needed per scanned address, and so that startScan() is guaranteed to return quickly. - Create independent threads for posting discovery results, as this can take a long time and might not finish before the executor shuts down. * Make NetworkHandler thread-safe * Make SpeedTestHandler thread-safe * Make sure that process output is consumed * Implement tool-specific latency parsing * Fix NetworkDiscoveryService configuration and make the thread pool size configurable * i18n * Fix test Signed-off-by: Leo Siepel <leosiepel@gmail.com> Co-authored-by: Wouter Born <github@maindrain.net> Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com>
…nhab#17972) * Fix performance * Improvements * Add logging * Improve logging * Update bundles/org.openhab.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java * Improve thread handling * Improve shutdown * thread cleanup * Improve per thread allocation * Stop on finishing all interfaces * Change Arping to make use of completeablefeature * Seperate presence detection from lifecycle * Improve logging and filtering * Create seperate ScheduledExecutorService * Fix review comment * Improve network address checks * Improve interrupthandling * Revert threadlocal * Refactor Presencedetection * Make LatencyParser accept Windows' <1ms syntax for minimal latency * Remove misleading reference to non-existing NetworkHandlerBuilder * Handle thread-safety of NetworkDiscoveryService.executorService * Fix network interface exclusion * Make PresenceDetectionValue thread-safe * Partial refactoring of PresenceDetection Fixes: - Synchronization of lastSeen - Joining of async tasks - Minor logging improvements Addition: - Create setIcmpPingMethod() * Partial refactoring of NetworkDiscoveryService Fixes: - Correct the number of addresses in a /24 subnet. - Restructure task creation so that one less thread is needed per scanned address, and so that startScan() is guaranteed to return quickly. - Create independent threads for posting discovery results, as this can take a long time and might not finish before the executor shuts down. * Make NetworkHandler thread-safe * Make SpeedTestHandler thread-safe * Make sure that process output is consumed * Implement tool-specific latency parsing * Fix NetworkDiscoveryService configuration and make the thread pool size configurable * i18n * Fix test Signed-off-by: Leo Siepel <leosiepel@gmail.com> Co-authored-by: Wouter Born <github@maindrain.net> Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com>
…nhab#17972) * Fix performance * Improvements * Add logging * Improve logging * Update bundles/org.openhab.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java * Improve thread handling * Improve shutdown * thread cleanup * Improve per thread allocation * Stop on finishing all interfaces * Change Arping to make use of completeablefeature * Seperate presence detection from lifecycle * Improve logging and filtering * Create seperate ScheduledExecutorService * Fix review comment * Improve network address checks * Improve interrupthandling * Revert threadlocal * Refactor Presencedetection * Make LatencyParser accept Windows' <1ms syntax for minimal latency * Remove misleading reference to non-existing NetworkHandlerBuilder * Handle thread-safety of NetworkDiscoveryService.executorService * Fix network interface exclusion * Make PresenceDetectionValue thread-safe * Partial refactoring of PresenceDetection Fixes: - Synchronization of lastSeen - Joining of async tasks - Minor logging improvements Addition: - Create setIcmpPingMethod() * Partial refactoring of NetworkDiscoveryService Fixes: - Correct the number of addresses in a /24 subnet. - Restructure task creation so that one less thread is needed per scanned address, and so that startScan() is guaranteed to return quickly. - Create independent threads for posting discovery results, as this can take a long time and might not finish before the executor shuts down. * Make NetworkHandler thread-safe * Make SpeedTestHandler thread-safe * Make sure that process output is consumed * Implement tool-specific latency parsing * Fix NetworkDiscoveryService configuration and make the thread pool size configurable * i18n * Fix test Signed-off-by: Leo Siepel <leosiepel@gmail.com> Co-authored-by: Wouter Born <github@maindrain.net> Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com>
* [network] Fix discovery performance causing a slow openHAB start (#17972) * Fix performance * Improvements * Add logging * Improve logging * Update bundles/org.openhab.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java * Improve thread handling * Improve shutdown * thread cleanup * Improve per thread allocation * Stop on finishing all interfaces * Change Arping to make use of completeablefeature * Seperate presence detection from lifecycle * Improve logging and filtering * Create seperate ScheduledExecutorService * Fix review comment * Improve network address checks * Improve interrupthandling * Revert threadlocal * Refactor Presencedetection * Make LatencyParser accept Windows' <1ms syntax for minimal latency * Remove misleading reference to non-existing NetworkHandlerBuilder * Handle thread-safety of NetworkDiscoveryService.executorService * Fix network interface exclusion * Make PresenceDetectionValue thread-safe * Partial refactoring of PresenceDetection Fixes: - Synchronization of lastSeen - Joining of async tasks - Minor logging improvements Addition: - Create setIcmpPingMethod() * Partial refactoring of NetworkDiscoveryService Fixes: - Correct the number of addresses in a /24 subnet. - Restructure task creation so that one less thread is needed per scanned address, and so that startScan() is guaranteed to return quickly. - Create independent threads for posting discovery results, as this can take a long time and might not finish before the executor shuts down. * Make NetworkHandler thread-safe * Make SpeedTestHandler thread-safe * Make sure that process output is consumed * Implement tool-specific latency parsing * Fix NetworkDiscoveryService configuration and make the thread pool size configurable * i18n * Fix test Signed-off-by: Leo Siepel <leosiepel@gmail.com> Co-authored-by: Wouter Born <github@maindrain.net> Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com> * [network] Use a threaded consumer for Process output consumption (#19398) * Use a threaded consumer for Process output consumption Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com> --------- Signed-off-by: Leo Siepel <leosiepel@gmail.com> Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com> Co-authored-by: lsiepel <leosiepel@gmail.com> Co-authored-by: Wouter Born <github@maindrain.net> Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com>
performArpPing
#16262No breaking changes.
Improved logging
Improved separation of concerns
Improved parsing of ping results (<1ms)
Improved resource use (memory and threads)
Discovery takes binding configuration into consideration
should be backported to 5.0.x and 4.3.x.
Testable jar 5.1.0 snapshot (2025-09-05) or later: https://1drv.ms/u/c/70ea7ac46bc61c73/EegJadzMpp1IlQYIN5kMCfwBpyUEcQF8ReQZGRMTmTcllw?e=TQEpxt