Skip to content

Conversation

vicennial
Copy link
Contributor

@vicennial vicennial commented Oct 14, 2025

What changes were proposed in this pull request?

Sets a lower bound of 1 to values passed into Thread#wait to avoid an unintentional indefinite wait.

Why are the changes needed?

A bug in ExecuteGrpcResponseSender causes RPC streams to hang indefinitely when the configured deadline passes. The bug was introduced in PR during migration from System.currentTimeMillis() to System.nanoTime(), where an integer division error converts sub-millisecond timeout values to 0, triggering Java's wait(0) behavior (infinite wait).

Root Cause

executionObserver.responseLock.wait(timeoutNs / NANOS_PER_MILLIS)  // ← BUG
The Problem: When deadlineTimeNs < System.nanoTime() (deadline has passed):

  • Math.max(1, negative_value) clamps to 1 nanosecond
  • Math.min(progressInterval_ns, 1) remains 1 nanosecond
  • Integer division: 1 / 1,000,000 = 0 milliseconds
  • wait(0) in Java means wait indefinitely until notified
  • No notification arrives (execution already completed), thread hangs forever

While one the loop conditions guards against deadlineTimeNs < System.nanoTime(), it isn’t sufficient as the deadline can elapse while inside the loop (the time is freshly fetched in the latter timeout calculation). The probability of occurrence can be exacerbated by GC pauses.

Conditions Required for Bug to Trigger

The bug manifests when all of the following conditions are met:

  • Reattachable execution enabled (CONNECT_EXECUTE_REATTACHABLE_ENABLED = true)
  • Execution completes prior to the deadline within the inner loop
  • (all responses sent before deadline)
  • Deadline passes within the inner loop

Does this PR introduce any user-facing change?

No.

How was this patch tested?

N/A

Was this patch authored or co-authored using generative AI tooling?

No.

@vicennial
Copy link
Contributor Author

@hvanhovell / @HyukjinKwon PTAL!

@vicennial
Copy link
Contributor Author

We should backport this into older relases as well

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hvanhovell
Copy link
Contributor

@vicennial which versions? Spark 4.0? Or also 3.5?

@vicennial
Copy link
Contributor Author

@hvanhovell upto 4.0 is sufficient, 3.5 does not have the issue

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@hvanhovell
Copy link
Contributor

Merging to master/4.0. Thanks!

asf-gitbox-commits pushed a commit that referenced this pull request Oct 14, 2025
…conditions in `ExecuteGrpcResponseSender`

### What changes were proposed in this pull request?

Sets a lower bound of `1` to values passed into `Thread#wait` to avoid an unintentional indefinite wait.

### Why are the changes needed?

A bug in `ExecuteGrpcResponseSender` causes RPC streams to hang indefinitely when the configured deadline passes. The bug was introduced in [PR](https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]) during migration from System.currentTimeMillis() to System.nanoTime(), where an integer division error converts sub-millisecond timeout values to 0, triggering Java's wait(0) behavior (infinite wait).

#### Root Cause

`executionObserver.responseLock.wait(timeoutNs / NANOS_PER_MILLIS)  // ← BUG`
The Problem: When `deadlineTimeNs < System.nanoTime()` (deadline has passed):

- Math.max(1, negative_value) clamps to 1 nanosecond
- Math.min(progressInterval_ns, 1) remains 1 nanosecond
- Integer division: 1 / 1,000,000 = 0 milliseconds
- wait(0) in Java means wait indefinitely until notified
- No notification arrives (execution already completed), thread hangs forever

While one the loop conditions guards against `deadlineTimeNs < System.nanoTime()`, it isn’t sufficient as the deadline can elapse while inside the loop (the time is freshly fetched in the latter timeout calculation). The probability of occurrence can be exacerbated by GC pauses.

#### Conditions Required for Bug to Trigger

The bug manifests when all of the following conditions are met:

- Reattachable execution enabled (CONNECT_EXECUTE_REATTACHABLE_ENABLED = true)
- Execution completes prior to the deadline within the inner loop
- (all responses sent before deadline)
- Deadline passes within the inner loop

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #52609 from vicennial/fixWait.

Authored-by: vicennial <venkata.gudesa@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
(cherry picked from commit ff0f1ab)
Signed-off-by: Herman van Hovell <herman@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants