-
-
Notifications
You must be signed in to change notification settings - Fork 457
Isolate discovery result registrations from binding threads #5032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Isolate discovery result registrations from binding threads #5032
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves concurrency and thread safety in the OpenHab discovery system by isolating discovery result registrations from binding threads and addressing thread contention issues in AbstractDiscoveryService
and DiscoveryServiceRegistryImpl
. The primary goal is to prevent thread starvation and deadlocks that occur when binding threads are blocked waiting for slow discovery registrations.
- Discovery result registrations are now executed asynchronously using the discovery thread pool instead of blocking binding threads
- Thread safety improvements include better synchronization patterns and reduced lock contention
- Collection type changes to thread-safe alternatives to prevent concurrent modification issues
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
DiscoveryServiceRegistryImpl.java | Removes unnecessary synchronized methods, changes to thread-safe collections, and improves synchronization patterns to reduce contention |
AbstractDiscoveryService.java | Adds asynchronous execution of discovery listener notifications and improves thread safety with better synchronization of mutable state |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
...nfig.discovery/src/main/java/org/openhab/core/config/discovery/AbstractDiscoveryService.java
Show resolved
Hide resolved
...y/src/main/java/org/openhab/core/config/discovery/internal/DiscoveryServiceRegistryImpl.java
Show resolved
Hide resolved
...nfig.discovery/src/main/java/org/openhab/core/config/discovery/AbstractDiscoveryService.java
Show resolved
Hide resolved
By using the discovery thread pool to do the actual registrations in AbstractDiscoveryService, binding threads won't have to wait for the registration to complete, which can be slow. In addition, some thread-safety fixes have been done in AbstractDiscoveryService and DiscoveryServiceRegistryImpl, both to avoid contention and to ensure isolation of mutable objects. Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
459f33c
to
c78ccda
Compare
I pushed a rebase to latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for diving into this and for your detailed explanations!
I must admit, I cannot assess the impact of all the small changes that you did here, but it's likely a good moment to have this merged and find out if there are any corner cases where we might have regressions.
@Nadahar - some tests in addons are failing today - see https://ci.openhab.org/job/openHAB-Addons/1928/. It seems they are all related to discovery, so perhaps you could help figuring out why? One example: https://github.com/openhab/openhab-addons/blob/main/bundles/org.openhab.binding.boschshc/src/test/java/org/openhab/binding/boschshc/internal/discovery/ThingDiscoveryServiceTest.java:
|
@jlaur Let's see how much is resolved now that I've rebased openhab/openhab-addons#19230 before I start looking into the details. |
@jlaur I'm looking at org.openhab.binding.boschshc now, and that test failure is definitely related to this PR. The test wants to verify the discovery result, but it's no longer registered because the registration is now done async. This mocking really isn't my strong suite, but I'll try to figure out something. Anybody else that have some hints as to how to address that are very welcome to chime in. I guess it could either be that the threads doing the registration don't exist because of the mock, or it could be that they do run, but that the results aren't in immediately after method return, as the test expects. |
I've fixed |
openhab/openhab-core#5032 made discovery result registration asynchronous, which led to some tests failing because the discovery results aren't immediately available one the call to thingDiscovered() returns - because the registration itself is done in a different thread. This has been addressed by pausing slightly before testing the results. Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
openhab/openhab-core#5032 made discovery result registration asynchronous, which led to some tests failing because the discovery results aren't immediately available one the call to thingDiscovered() returns - because the registration itself is done in a different thread. This has been addressed by pausing slightly before testing the results. Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
openhab/openhab-core#5032 made discovery result registration asynchronous, which led to some tests failing because the discovery results aren't immediately available one the call to thingDiscovered() returns - because the registration itself is done in a different thread. This has been addressed by pausing slightly before testing the results. Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
openhab/openhab-core#5032 made discovery result registration asynchronous, which led to some tests failing because the discovery results aren't immediately available one the call to thingDiscovered() returns - because the registration itself is done in a different thread. This has been addressed by pausing slightly before testing the results. Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
This pull request has been mentioned on openHAB Community. There might be relevant details there: https://community.openhab.org/t/openhab-5-1-milestone-discussion/166385/55 |
By using the discovery thread pool to do the actual registrations in
AbstractDiscoveryService
, binding threads won't have to wait for the registration to complete, which can be slow.In addition, some thread-safety fixes have been done in
AbstractDiscoveryService
andDiscoveryServiceRegistryImpl
, both to avoid contention and to ensure isolation of mutable objects.This has emerged as an idea of mine after "investigating" several cases where OH gets very slow, stops working all together or crashes. Blocked threads with ever-increasing memory consumption as the thread pool queues grows because they are never processed, seems to be a common theme.
There are many reasons for the situation, this PR only addresses a little piece of the puzzle. But, I think it can make a real difference, because by using an executor to do the actual registration, we "break" the chain of locks/monitors being held. It no longer matters what locks the binding holds when
thingDiscovered()
is called, when the registration is done by the thread pool, it won't "inherit" any of the locks, and the rest of the system is free to move forward.As everybody know, I assume, one of the "classic traps" in concurrency is to lock two locks/monitors in different order in different places of the code. Doing that practically ensures that you will have a deadlock, but it can be hard to avoid when parts of the system invokes other parts of the system in a complex way. The best way I've found to avoid this is to hold the locks as short as possible, and don't make calls into "unknown code" while holding locks, if at all avoidable. Invoking a listener method while holding a lock is a typical situation that has a high risk of causing a deadlock.
Even without a deadlock, the extreme slowness of the registration of discovery results can have cascading effects throughout the system. It would be interesting to find the reason for the slowness, and I hope that some of the "contention fixes" in this PR will help mitigate the problem. Regardless, an added bonus of isolating the locks by using the thread pool to do the registrations, is that this cascading effect is prevented.
Recently, when we worked on openhab/openhab-addons#17972, we had to make the Network binding spawn separate threads for registering the results to make the discovery operation smooth/fast. This PR will make that unnecessary, and it will benefit all bindings.
When I created openhab/openhab-addons#19351 a couple of days ago, it dawned on me that trying to chase this down on a binding-by-binding basis was the wrong approach.
Here is an excerpt from the thread dump from the associated forum thread (pay attention to lock
0x00000000814f9128
):For all mDNS discoveries, the
JmDNS
threads will be he ones performing the actual "discovery process", because they are the ones the executes the listener method. In the above case,JmDNS pool-26-thread-1
is busy deep inside Gson doing something that I assume takes a long time, or is held up by something else. In the meanwhile, 4 other JmDNS threads and the Teleinfo binding discovery thread are blocked waiting forJmDNS pool-26-thread-1
to finish. I don't know how big the JmDNS thread pool is, but this could potentially exchaust the thread pool. For Teleinfo, the thread nameOH-binding-teleinfo:serialcontroller:teleinfoserial
hints that this is the sole thread responsible for communication with the serial controller that is being blocked.By isolating the discovery result registration from the rest of the system, we will no longer have situations like this.