[SPARK-53900][CONNECT] Fix unintentional `Thread.wait(0)` under rare conditions in `ExecuteGrpcResponseSender` #52609

vicennial · 2025-10-14T12:52:05Z

What changes were proposed in this pull request?

Sets a lower bound of 1 to values passed into Thread#wait to avoid an unintentional indefinite wait.

Why are the changes needed?

A bug in ExecuteGrpcResponseSender causes RPC streams to hang indefinitely when the configured deadline passes. The bug was introduced in PR during migration from System.currentTimeMillis() to System.nanoTime(), where an integer division error converts sub-millisecond timeout values to 0, triggering Java's wait(0) behavior (infinite wait).

Root Cause

executionObserver.responseLock.wait(timeoutNs / NANOS_PER_MILLIS) // ← BUG
The Problem: When deadlineTimeNs < System.nanoTime() (deadline has passed):

Math.max(1, negative_value) clamps to 1 nanosecond
Math.min(progressInterval_ns, 1) remains 1 nanosecond
Integer division: 1 / 1,000,000 = 0 milliseconds
wait(0) in Java means wait indefinitely until notified
No notification arrives (execution already completed), thread hangs forever

While one the loop conditions guards against deadlineTimeNs < System.nanoTime(), it isn’t sufficient as the deadline can elapse while inside the loop (the time is freshly fetched in the latter timeout calculation). The probability of occurrence can be exacerbated by GC pauses.

Conditions Required for Bug to Trigger

The bug manifests when all of the following conditions are met:

Reattachable execution enabled (CONNECT_EXECUTE_REATTACHABLE_ENABLED = true)
Execution completes prior to the deadline within the inner loop
(all responses sent before deadline)
Deadline passes within the inner loop

Does this PR introduce any user-facing change?

No.

How was this patch tested?

N/A

Was this patch authored or co-authored using generative AI tooling?

No.

vicennial · 2025-10-14T12:52:49Z

@hvanhovell / @HyukjinKwon PTAL!

vicennial · 2025-10-14T12:53:15Z

We should backport this into older relases as well

hvanhovell

LGTM

hvanhovell · 2025-10-14T12:56:31Z

@vicennial which versions? Spark 4.0? Or also 3.5?

vicennial · 2025-10-14T13:27:33Z

@hvanhovell upto 4.0 is sufficient, 3.5 does not have the issue

dongjoon-hyun

+1, LGTM.

hvanhovell · 2025-10-14T16:03:17Z

Merging to master/4.0. Thanks!

…conditions in `ExecuteGrpcResponseSender` ### What changes were proposed in this pull request? Sets a lower bound of `1` to values passed into `Thread#wait` to avoid an unintentional indefinite wait. ### Why are the changes needed? A bug in `ExecuteGrpcResponseSender` causes RPC streams to hang indefinitely when the configured deadline passes. The bug was introduced in [PR](https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]) during migration from System.currentTimeMillis() to System.nanoTime(), where an integer division error converts sub-millisecond timeout values to 0, triggering Java's wait(0) behavior (infinite wait). #### Root Cause `executionObserver.responseLock.wait(timeoutNs / NANOS_PER_MILLIS) // ← BUG` The Problem: When `deadlineTimeNs < System.nanoTime()` (deadline has passed): - Math.max(1, negative_value) clamps to 1 nanosecond - Math.min(progressInterval_ns, 1) remains 1 nanosecond - Integer division: 1 / 1,000,000 = 0 milliseconds - wait(0) in Java means wait indefinitely until notified - No notification arrives (execution already completed), thread hangs forever While one the loop conditions guards against `deadlineTimeNs < System.nanoTime()`, it isn’t sufficient as the deadline can elapse while inside the loop (the time is freshly fetched in the latter timeout calculation). The probability of occurrence can be exacerbated by GC pauses. #### Conditions Required for Bug to Trigger The bug manifests when all of the following conditions are met: - Reattachable execution enabled (CONNECT_EXECUTE_REATTACHABLE_ENABLED = true) - Execution completes prior to the deadline within the inner loop - (all responses sent before deadline) - Deadline passes within the inner loop ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52609 from vicennial/fixWait. Authored-by: vicennial <venkata.gudesa@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit ff0f1ab) Signed-off-by: Herman van Hovell <herman@databricks.com>

init

60b1843

github-actions bot added SQL CONNECT labels Oct 14, 2025

hvanhovell approved these changes Oct 14, 2025

View reviewed changes

dongjoon-hyun approved these changes Oct 14, 2025

View reviewed changes

asf-gitbox-commits closed this in ff0f1ab Oct 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53900][CONNECT] Fix unintentional `Thread.wait(0)` under rare conditions in `ExecuteGrpcResponseSender` #52609

[SPARK-53900][CONNECT] Fix unintentional `Thread.wait(0)` under rare conditions in `ExecuteGrpcResponseSender` #52609

vicennial commented Oct 14, 2025 •

edited

Loading

Uh oh!

vicennial commented Oct 14, 2025

Uh oh!

vicennial commented Oct 14, 2025

Uh oh!

hvanhovell left a comment

Uh oh!

hvanhovell commented Oct 14, 2025

Uh oh!

vicennial commented Oct 14, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

hvanhovell commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-53900][CONNECT] Fix unintentional Thread.wait(0) under rare conditions in ExecuteGrpcResponseSender #52609

[SPARK-53900][CONNECT] Fix unintentional Thread.wait(0) under rare conditions in ExecuteGrpcResponseSender #52609

Conversation

vicennial commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Root Cause

Conditions Required for Bug to Trigger

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

vicennial commented Oct 14, 2025

Uh oh!

vicennial commented Oct 14, 2025

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Oct 14, 2025

Uh oh!

vicennial commented Oct 14, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-53900][CONNECT] Fix unintentional `Thread.wait(0)` under rare conditions in `ExecuteGrpcResponseSender` #52609

[SPARK-53900][CONNECT] Fix unintentional `Thread.wait(0)` under rare conditions in `ExecuteGrpcResponseSender` #52609

vicennial commented Oct 14, 2025 •

edited

Loading