-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53900][CONNECT] Fix unintentional Thread.wait(0)
under rare conditions in ExecuteGrpcResponseSender
#52609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@hvanhovell / @HyukjinKwon PTAL! |
We should backport this into older relases as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@vicennial which versions? Spark 4.0? Or also 3.5? |
@hvanhovell upto 4.0 is sufficient, 3.5 does not have the issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Merging to master/4.0. Thanks! |
…conditions in `ExecuteGrpcResponseSender` ### What changes were proposed in this pull request? Sets a lower bound of `1` to values passed into `Thread#wait` to avoid an unintentional indefinite wait. ### Why are the changes needed? A bug in `ExecuteGrpcResponseSender` causes RPC streams to hang indefinitely when the configured deadline passes. The bug was introduced in [PR](https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]) during migration from System.currentTimeMillis() to System.nanoTime(), where an integer division error converts sub-millisecond timeout values to 0, triggering Java's wait(0) behavior (infinite wait). #### Root Cause `executionObserver.responseLock.wait(timeoutNs / NANOS_PER_MILLIS) // ← BUG` The Problem: When `deadlineTimeNs < System.nanoTime()` (deadline has passed): - Math.max(1, negative_value) clamps to 1 nanosecond - Math.min(progressInterval_ns, 1) remains 1 nanosecond - Integer division: 1 / 1,000,000 = 0 milliseconds - wait(0) in Java means wait indefinitely until notified - No notification arrives (execution already completed), thread hangs forever While one the loop conditions guards against `deadlineTimeNs < System.nanoTime()`, it isn’t sufficient as the deadline can elapse while inside the loop (the time is freshly fetched in the latter timeout calculation). The probability of occurrence can be exacerbated by GC pauses. #### Conditions Required for Bug to Trigger The bug manifests when all of the following conditions are met: - Reattachable execution enabled (CONNECT_EXECUTE_REATTACHABLE_ENABLED = true) - Execution completes prior to the deadline within the inner loop - (all responses sent before deadline) - Deadline passes within the inner loop ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52609 from vicennial/fixWait. Authored-by: vicennial <venkata.gudesa@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit ff0f1ab) Signed-off-by: Herman van Hovell <herman@databricks.com>
What changes were proposed in this pull request?
Sets a lower bound of
1
to values passed intoThread#wait
to avoid an unintentional indefinite wait.Why are the changes needed?
A bug in
ExecuteGrpcResponseSender
causes RPC streams to hang indefinitely when the configured deadline passes. The bug was introduced in PR during migration from System.currentTimeMillis() to System.nanoTime(), where an integer division error converts sub-millisecond timeout values to 0, triggering Java's wait(0) behavior (infinite wait).Root Cause
executionObserver.responseLock.wait(timeoutNs / NANOS_PER_MILLIS) // ← BUG
The Problem: When
deadlineTimeNs < System.nanoTime()
(deadline has passed):While one the loop conditions guards against
deadlineTimeNs < System.nanoTime()
, it isn’t sufficient as the deadline can elapse while inside the loop (the time is freshly fetched in the latter timeout calculation). The probability of occurrence can be exacerbated by GC pauses.Conditions Required for Bug to Trigger
The bug manifests when all of the following conditions are met:
Does this PR introduce any user-facing change?
No.
How was this patch tested?
N/A
Was this patch authored or co-authored using generative AI tooling?
No.