[SPARK-52673][CONNECT][CLIENT] Add grpc RetryInfo handling to Spark Connect retry policies #51363

khakhlyuk · 2025-07-03T10:56:54Z

What changes were proposed in this pull request?

Spark Connect Client has a set of retry policies that specify which errors coming from the Server can be retried.
This change adds the capability for the Spark Connect Client to use server-provided retry information according to the grpc standards: https://github.com/googleapis/googleapis/blob/master/google/rpc/error_details.proto#L91
The server can include RetryInfo gRPC message containing retry_delay field in its error response. The Client will now use RetryInfo message to classify the error as retriable and will use retry_delay to calculate the next time to wait. This behavior is in line with the gRPC standard for client-server communication.
The change is needed for two reasons:

If the Server is under heavy load or a task takes more time, it can tell the client to wait longer using the retry_delay field.
If the Server needs to introduce a new retryable error, it can simply include RetryInfo in the error message. The error message will be retried automatically by the client. No changes to the client-side retry policies are needed to retry the new error.

Changes in detail

Adding new recognize_server_retry_delay and max_server_retry_delay options for RetryPolicy classes in Python and Scala clients.
All policies with recognize_server_retry_delay=True will take RetryInfo.retry_delay into account when calculating the next backoff.
retry_delay can override client's max_backoff
retry_delay is limited by max_server_retry_delay (10 minutes by default).
When the server stops sending high retry_delays, the client goes back to using its own backoff policy limited by max_backoff.
DefaultPolicy has recognize_server_retry_delay=True and will use retry_delay in the backoff calculation.
Additionally, DefaultPolicy will classify all errors with RetryInfo as retryable.
If an error message can be retried by several policies, only retry it with the first one (highest prio) and then stop. This change is needed because DefaultPolicy now retries all errors with RetryInfo. If we keep the existing behaviour, an error that is both has the RetryInfo and is matched by a different CustomPolicy, would be retried both by the DefaultPolicy and by the CustomPolicy. This can lead to excessively long retry periods and complicates the planning of total retry times.
Moving retry policy related tests from test_client.py to a new test_client_retries.py file. Same for scala.
Extending docstrings.

Why are the changes needed?

See above

Does this PR introduce any user-facing change?

The clients retry all errors with RetryInfo grpc message using the DefaultPolicy.
The error is only retried by the first policy that matches it.

How was this patch tested?

Old and new unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

HyukjinKwon · 2025-07-14T08:58:43Z

python/pyspark/sql/tests/connect/client/test_client_retries.py

+
+import unittest
+
+import google.protobuf.any_pb2 as any_pb2


Can we import this under if should_test_connect?

HyukjinKwon · 2025-07-14T09:04:33Z

Merged to master.

…onnect retry policies ### What changes were proposed in this pull request? Spark Connect Client has a set of retry policies that specify which errors coming from the Server can be retried. This change adds the capability for the Spark Connect Client to use server-provided retry information according to the grpc standards: https://github.com/googleapis/googleapis/blob/master/google/rpc/error_details.proto#L91 The server can include `RetryInfo` gRPC message containing `retry_delay` field in its error response. The Client will now use `RetryInfo` message to classify the error as retriable and will use `retry_delay` to calculate the next time to wait. This behavior is in line with the gRPC standard for client-server communication. The change is needed for two reasons: 1) If the Server is under heavy load or a task takes more time, it can tell the client to wait longer using the `retry_delay` field. 2) If the Server needs to introduce a new retryable error, it can simply include `RetryInfo` in the error message. The error message will be retried automatically by the client. No changes to the client-side retry policies are needed to retry the new error. #### Changes in detail - Adding new `recognize_server_retry_delay` and `max_server_retry_delay` options for `RetryPolicy` classes in Python and Scala clients. - All policies with `recognize_server_retry_delay=True` will take `RetryInfo.retry_delay` into account when calculating the next backoff. - `retry_delay` can override client's `max_backoff` - `retry_delay` is limited by `max_server_retry_delay` (10 minutes by default). - When the server stops sending high retry_delays, the client goes back to using its own backoff policy limited by `max_backoff`. - `DefaultPolicy` has `recognize_server_retry_delay=True` and will use `retry_delay` in the backoff calculation. - Additionally, DefaultPolicy will classify all errors with `RetryInfo` as retryable. - If an error message can be retried by several policies, only retry it with the first one (highest prio) and then stop. This change is needed because `DefaultPolicy` now retries all errors with `RetryInfo`. If we keep the existing behaviour, an error that is both has the `RetryInfo` and is matched by a different `CustomPolicy`, would be retried both by the `DefaultPolicy` and by the `CustomPolicy`. This can lead to excessively long retry periods and complicates the planning of total retry times. - Moving retry policy related tests from `test_client.py` to a new `test_client_retries.py` file. Same for scala. - Extending docstrings. ### Why are the changes needed? See above ### Does this PR introduce _any_ user-facing change? 1. The clients retry all errors with `RetryInfo` grpc message using the DefaultPolicy. 2. The error is only retried by the first policy that matches it. ### How was this patch tested? Old and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51363 from khakhlyuk/retryinfo. Authored-by: Alex Khakhlyuk <alex.khakhlyuk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added SQL PYTHON CONNECT labels Jul 3, 2025

khakhlyuk added 2 commits July 3, 2025 17:16

python

c979b5f

scala

d5541b5

khakhlyuk force-pushed the retryinfo branch from f6d5fa7 to d5541b5 Compare July 3, 2025 15:19

khakhlyuk added 2 commits July 3, 2025 17:20

format

e98010c

lint and test fix

fd99152

HyukjinKwon reviewed Jul 14, 2025

View reviewed changes

HyukjinKwon approved these changes Jul 14, 2025

View reviewed changes

move imports

f4051f2

HyukjinKwon closed this in 59303f7 Jul 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52673][CONNECT][CLIENT] Add grpc RetryInfo handling to Spark Connect retry policies #51363

[SPARK-52673][CONNECT][CLIENT] Add grpc RetryInfo handling to Spark Connect retry policies #51363

Uh oh!

khakhlyuk commented Jul 3, 2025 •

edited

Loading

Uh oh!

HyukjinKwon Jul 14, 2025

Uh oh!

khakhlyuk Jul 14, 2025

Uh oh!

HyukjinKwon commented Jul 14, 2025

Uh oh!

Uh oh!

[SPARK-52673][CONNECT][CLIENT] Add grpc RetryInfo handling to Spark Connect retry policies #51363

[SPARK-52673][CONNECT][CLIENT] Add grpc RetryInfo handling to Spark Connect retry policies #51363

Uh oh!

Conversation

khakhlyuk commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Changes in detail

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

khakhlyuk Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 14, 2025

Uh oh!

Uh oh!

khakhlyuk commented Jul 3, 2025 •

edited

Loading