-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[network] Fix case where state no longer updates #19398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If this turns out to work well, I think it's tempting to try to get If we have one implementation that is "clean" and works, it's better to reuse that than having everybody "reinvent the wheel". |
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
d7b177c
to
471cd0b
Compare
Here is the JAR in case anybody wants to test it without having to build. |
I've been running with 12 network Things for a little while now, all the "network threads" are still idling/parked. They work very shortly when polling a device, then goes back to idling. But, it might not mean anything, I don't think this bug exists on Windows in any case. edit: Running for 3 hours, still no problems. |
I've tried this commit - I've built it from scratch though - your jar was complaining about missing speedtest? - and it is still happening on my main Openhab instance on Rapsberry Pi. Its not happening on my Windows Openhab build, but over there I just have couple of things and items for testing only. I'm also using Windows ping. I have feeling that it depends on speed of execution, if whole process is fast, then there is no problem but if whole cycle takes too long, then we situation like: So maybe reason why its not happening on Windows is simple - it is fast enough to complete it? |
Yes, I had a problem with feature resolution when building it, so there is probably some dependency that has changed version between some builds. To rectify this, I don't know any other way than to build all add-ons - which takes a couple of hours or so, that I can't really use the computer for anything else (because it is extremely laggy while building).
You are positive that you got the "right version" of running? I've done all I can think of with the
That would absolutely make sense - the problem is that I can't find anything that could cause this in the code. The pool in question is only used for scheduling tasks, not for completing anything. So, to consume all threads in the pool, the scheduled tasks don't return in time for the next run - which should be close to impossible with a "normal" refresh interval. This is also why I've been focusing on the
If it's caused by a problem with the The way we did it was to drain the pipe until it was closed while the process was running, and only when the pipe was closed, continue finishing the task. But, I know that there are situations (although I thought they were quite rare) where the pipe doesn't close as it should, which would leave the whole thing essentially waiting forever for the pipe to close. To address this, I've now moved the reading of the pipe to a separate thread, so the "task thread" still shouldn't be caught up even if the pipe isn't closed. However, the "task thread" still needs to retrieve the output to analyze it, so it could still potentially hang when trying to Could you make a new thread dump with it "hanging" with this PR running? There must be something I'm missing that is happening here, I just need to figure out what it is. |
You can build from the specific addon folder itself and it is done in less then a minute. It only builds that jar. I remember you have an advanced build setup, so you are probably aware and something else prevents this. |
Yes, I normally do this, obviously, but sometimes something goes wrong with the feature resolution, and something is f. up with the dependencies. When that happens, I get a build error during feature verification, after the JAR itself has been created, and that's what happened yesterday. The only way I know to resolve that is to build them all, because then all the feature resolution stuff is redone. I've done that now, so here is a new version (I see that this version is larger, so something was definitely missing from the other one): |
Here is dump using your .jar
I also started adding some extra debugs on my end to observe what is going, not sure if it helps somehow
|
Argh, I'm sorry to say, but you were absolutely right in your previous analysis. The same pool is used to resolve the This is clearly the problem then, I'll have to see how it used to be done to see what a potential solution could be (assuming that we don't want to create yet another thread pool). |
Ok, I see now that the binding used to create a single threaded executor service (basically a complicated way to create a thread) for each "presence detection task" that was used to resolve the I see no reason to create another single threaded executor, but I think we need a dedicated pool to resolve the |
I just completed setting this up on a Linux (Fedora) VM to see if I could reproduce the problem. I've configured 9 things that does pings, have 5 threads in the pool, and still it doesn't happen. It might have to do with whether these happen to be schedules to ping at the same time or not, and that's probably somewhat random. The problem is now clear, so I don't mean to say that it shouldn't be fixed, I was just hoping that I could reproduce it so I could verify that I fixed it - but that doesn't seem to be so easy. edit: I just managed to get it to happen on the Linux VM 👍 It seems like some "luck" is needed as well. |
I managed to repro it also on Windows
|
I'm working on a fix, hopefully shouldn't be too long. |
This avoids thread starvation leading to Futures never resolving Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
e1bf0e5
to
570bbca
Compare
The build seems to fail because there are problems with openhab.org again. Building locally also fails for the same reason. Failed to load schema with public ID null, system ID https://openhab.org/schemas/thing-description-1.0.0.xsd: schema_reference.4: Failed to read schema document 'https://openhab.org/schemas/thing-description-1.0.0.xsd', because 1) could not find the document; 2) the document could not be read; 3) the root element of the document is not <xsd:schema>.: Unknown host openhab.org It seems like there are some DNS trouble, I don't think there is much I can do to make the build work in the meanwhile: https://community.openhab.org/t/openhab-org-temporary-blocked-because-of-certificate-issue/166513 |
Meanwhile, Eclipse can build it just fine without the schema, so my testing works both locally and on the VM without issue now. |
I added a random factor (up to 5 seconds) to the initial delay for the poll tasks, so that they aren't all scheduled to run at the exact same time. It can mean that the Things can take a few seconds to come online after startup, but I think it's better to have them spread out a little bit. Any viewpoints? |
Now the DNS and certificate trouble has been resolved, and building is possible, here is the latest version: |
Seems to work now! 🎉 |
@lsiepel I think this is good now. More testing would always be good, but the problem has been identified and addressed. I still think the changes to the But, having already made the changes to avoid this rare problem, I see no reason to not include it. This should probably be backported as well, since #17972 was. |
...rk/src/main/java/org/openhab/binding/network/internal/discovery/NetworkDiscoveryService.java
Outdated
Show resolved
Hide resolved
…binding/network/internal/discovery/NetworkDiscoveryService.java Signed-off-by: lsiepel <leosiepel@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into these issues deeply. LGTM
I tried to accept your suggestion, but it's already outdated. Did you already apply it? edit: Sorry, just saw 58fdbf6 now. |
Ultra minor, so applied it myself. waiting for build now. |
@lsiepel I think a backport should be done, but it would maybe be nice with a little bit more testing first, now that it's merged? |
Backport to 5.0.x fails due to conflicts. |
I can take care of that, the question is if we want to give it a few days to see if anything new pops up..? |
Yes, let's postpone a backport close to the next patch release. No date is set yet. Hopefully 5.1.0 m2 is first. |
…nhab#19398) * Use a threaded consumer for Process output consumption Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
…nhab#19398) * Use a threaded consumer for Process output consumption Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
…nhab#19398) * Use a threaded consumer for Process output consumption Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
* [network] Fix discovery performance causing a slow openHAB start (#17972) * Fix performance * Improvements * Add logging * Improve logging * Update bundles/org.openhab.binding.network/src/main/java/org/openhab/binding/network/internal/utils/NetworkUtils.java * Improve thread handling * Improve shutdown * thread cleanup * Improve per thread allocation * Stop on finishing all interfaces * Change Arping to make use of completeablefeature * Seperate presence detection from lifecycle * Improve logging and filtering * Create seperate ScheduledExecutorService * Fix review comment * Improve network address checks * Improve interrupthandling * Revert threadlocal * Refactor Presencedetection * Make LatencyParser accept Windows' <1ms syntax for minimal latency * Remove misleading reference to non-existing NetworkHandlerBuilder * Handle thread-safety of NetworkDiscoveryService.executorService * Fix network interface exclusion * Make PresenceDetectionValue thread-safe * Partial refactoring of PresenceDetection Fixes: - Synchronization of lastSeen - Joining of async tasks - Minor logging improvements Addition: - Create setIcmpPingMethod() * Partial refactoring of NetworkDiscoveryService Fixes: - Correct the number of addresses in a /24 subnet. - Restructure task creation so that one less thread is needed per scanned address, and so that startScan() is guaranteed to return quickly. - Create independent threads for posting discovery results, as this can take a long time and might not finish before the executor shuts down. * Make NetworkHandler thread-safe * Make SpeedTestHandler thread-safe * Make sure that process output is consumed * Implement tool-specific latency parsing * Fix NetworkDiscoveryService configuration and make the thread pool size configurable * i18n * Fix test Signed-off-by: Leo Siepel <leosiepel@gmail.com> Co-authored-by: Wouter Born <github@maindrain.net> Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com> * [network] Use a threaded consumer for Process output consumption (#19398) * Use a threaded consumer for Process output consumption Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com> --------- Signed-off-by: Leo Siepel <leosiepel@gmail.com> Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com> Co-authored-by: lsiepel <leosiepel@gmail.com> Co-authored-by: Wouter Born <github@maindrain.net> Co-authored-by: Ravi Nadahar <nadahar@rediffmail.com>
This is an attempt to fix #19397.
I've outsourced the consumption of the process output to a separate thread to prevent the "executing thread" for being blocked if the output isn't closed as it should.
In addition, I've added a couple of "emergency timeouts" that should hopefully prevent the thread from blocking for too long if things go wrong for some reason.
@lsiepel We'll need to do some testing to see how this works.