Skip to content

Conversation

@AnkitCLI
Copy link
Contributor

@AnkitCLI AnkitCLI commented May 6, 2025

PLUGIN-1808 Added a retry policy to handle 5xx server errors when fetching the service account access token in the BigQuery plugin

@AnkitCLI AnkitCLI changed the title Retry policy for service account Retry policy to service account for 5xx errros May 6, 2025
@AnkitCLI AnkitCLI changed the title Retry policy to service account for 5xx errros Retry policy to service account for 5xx errors May 6, 2025
@AnkitCLI AnkitCLI marked this pull request as ready for review May 6, 2025 07:05
@AnkitCLI AnkitCLI added the build Trigger unit test build label May 6, 2025
@AnkitCLI AnkitCLI changed the title Retry policy to service account for 5xx errors Retry policy to service account for 5xx errors for bigquery plugin May 6, 2025
@itsankit-google itsankit-google changed the title Retry policy to service account for 5xx errors for bigquery plugin [PLUGIN-1808] Retry policy to service account for 5xx errors for bigquery plugin Jun 2, 2025
Copy link
Member

@itsankit-google itsankit-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a second thought, I feel just move all the retry configs & retry policy to ServiceAccountAccessTokenProvider class.

For start, we can use default configs and no need to make them configurable.

This will make the PR look cleaner and changes will be limited to one class.


private BigQuerySourceConfig(@Nullable BigQueryConnectorConfig connection, @Nullable String dataset,
@Nullable String cmekKey, @Nullable String bucket, @Nullable String table) {
@Nullable String cmekKey, @Nullable String bucket, @Nullable String table) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation fix

@AnkitCLI AnkitCLI removed the build Trigger unit test build label Jun 3, 2025
@AnkitCLI AnkitCLI force-pushed the AddingRetryWithFailsafe branch 3 times, most recently from 7eb0bb3 to 46c5d1e Compare June 3, 2025 07:08
@AnkitCLI AnkitCLI added the build Trigger unit test build label Jun 3, 2025
}
return new AccessToken(token.getTokenValue(), token.getExpirationTime().getTime());
});
} catch (Exception e) {
Copy link
Member

@itsankit-google itsankit-google Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract cause from FailesafeException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

new ErrorCategory(ErrorCategoryEnum.PLUGIN),
"Unable to get service account access token after retries.",
e.getMessage(),
ErrorType.UNKNOWN,
Copy link
Member

@itsankit-google itsankit-google Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for server error use SYSTEM error type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private Configuration conf;
private GoogleCredentials credentials;
private static final Gson GSON = new Gson();
private static final Logger logger = LoggerFactory.getLogger(ServiceAccountAccessTokenProvider.class);
Copy link
Member

@itsankit-google itsankit-google Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually use private static final variables in uppercase: logger -> LOG

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 62 to 64
int initialRetryDuration = DEFAULT_INITIAL_RETRY_DURATION_SECONDS;
int maxRetryCount = DEFAULT_MAX_RETRY_COUNT;
int maxRetryDuration = DEFAULT_MAX_RETRY_DURATION_SECONDS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we creating copy of instance variables here? It eliminates the purpose of them being final & static

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 92 to 96
private RetryPolicy<Object> getRetryPolicy(int initialRetryDuration, int maxRetryDuration,
int maxRetryCount) {
return RetryPolicy.builder()
.handle(ServerErrorException.class)
.withBackoff(Duration.ofSeconds(initialRetryDuration), Duration.ofSeconds(maxRetryDuration))
.withMaxRetries(maxRetryCount)
.onRetry(event -> LOG.debug("Retry attempt {} due to {}", event.getAttemptCount(), event.getLastException().
getMessage()))
.onSuccess(event -> LOG.debug("Access Token Fetched Successfully."))
.onRetriesExceeded(event -> LOG.error("Retry limit reached for Service account."))
.build();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a private static method & retry policy can have a static final instance variable rather than creating retry policy for every call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@AnkitCLI AnkitCLI force-pushed the AddingRetryWithFailsafe branch 2 times, most recently from d52b351 to 1011aff Compare June 5, 2025 06:03
new ErrorCategory(ErrorCategoryEnum.PLUGIN),
"Unable to get service account access token after retries.",
t.getMessage(),
ErrorType.SYSTEM,
Copy link
Member

@itsankit-google itsankit-google Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getCause() should match first check whether it was a ServerException then only use SYSTEM otherwise UNKNOWN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 127 to 129
throw ErrorUtils.getProgramFailureException(new ErrorCategory(ErrorCategoryEnum.PLUGIN),
"Unable to refresh service account access token.", e.getMessage(),
ErrorType.UNKNOWN, true, e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use

public static ProgramFailureException getHttpResponseExceptionDetailsFromChain(Throwable e, String errorReason,
with externalDocUrl as https://cloud.google.com/compute/docs/troubleshooting/troubleshoot-metadata-server

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

});
} catch (FailsafeException e) {
Throwable t = e.getCause() != null ? e.getCause() : e;
throw ErrorUtils.getProgramFailureException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use

public static ProgramFailureException getHttpResponseExceptionDetailsFromChain(Throwable e, String errorReason,
with externalDocUrl as https://cloud.google.com/compute/docs/troubleshooting/troubleshoot-metadata-server

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@AnkitCLI AnkitCLI force-pushed the AddingRetryWithFailsafe branch from 1011aff to 4bf53f8 Compare June 5, 2025 07:25
public static final int BQ_DEFAULT_READ_TIMEOUT_SECONDS = 120;
public static final String DATASTORE_SUPPORTED_DOC_URL = "https://cloud.google.com/datastore/docs/concepts/errors";
public static final String BIG_TABLE_SUPPORTED_DOC_URL = "https://cloud.google.com/bigtable/docs/status-codes";
public static final String SERVER_ERROR_SUPPORTED_DOC_URL =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCE_METADATA_SERVER_ERROR_SUPPORTED_DOC_URL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

return msg != null && msg.matches("^5\\d{2}$"); // crude check for 5xx codes
}

private com.google.auth.oauth2.AccessToken safeGetAccessToken() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we mean by safeGetAccessToken here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually method names should indicate what are we trying to do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to retrieveAccessToken

Comment on lines 63 to 66
LOG.debug(
"Initializing RetryPolicy with the following configuration: MaxRetryCount: {}, InitialRetryDuration: {}s, " +
"MaxRetryDuration: {}s", DEFAULT_MAX_RETRY_COUNT, DEFAULT_INITIAL_RETRY_DURATION_SECONDS,
DEFAULT_MAX_RETRY_DURATION_SECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need of this log for every access token fetch.

This can be present in the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@Override
public void refresh() throws IOException {
try {
getCredentials().refresh();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add retries on getCredentials().refresh()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


private boolean isServerError(IOException e) {
String msg = e.getMessage();
return msg != null && msg.matches("^5\\d{2}$"); // crude check for 5xx codes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should broaden the matching regex to match something like :

Unexpected Error code 5xx trying to get security access token from Compute Engine metadata for the default service account

And we could also keep the compiled pattern as prival final static variable instead of compiling it for every call.

Regex matches are cheap but pattern compilation is costly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@AnkitCLI AnkitCLI requested a review from itsankit-google June 6, 2025 09:44
public static final int DEFAULT_MAX_RETRY_COUNT = 5;
public static final int DEFAULT_MAX_RETRY_DURATION_SECONDS = 80;
private static final RetryPolicy<Object> RETRY_POLICY = createRetryPolicy();
private static final Pattern SERVER_ERROR_PATTERN = Pattern.compile(".*5\\d{2}.*");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still not addressed : #1544 (comment)

We are still looking at very broad pattern of just finding 5xx in the error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern is broad as we are not sure of the error message and are only matching on 5xx.
Is there a way get the error message structure to narrow this regex match.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see this class : https://github.com/googleapis/google-auth-library-java/blob/d02ab85ae8d29402650d3cbf00e00363a903fe31/oauth2_http/java/com/google/auth/oauth2/ComputeEngineCredentials.java#L353-L354

We already know from where the error message is coming. Please look at the stacktrace of errors in the JIRA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added done

@AnkitCLI AnkitCLI force-pushed the AddingRetryWithFailsafe branch 2 times, most recently from 39dab19 to 2cd1c91 Compare June 16, 2025 06:20
@AnkitCLI AnkitCLI added build Trigger unit test build and removed build Trigger unit test build labels Jun 16, 2025
@AnkitCLI AnkitCLI requested a review from itsankit-google June 23, 2025 04:54
@AnkitCLI AnkitCLI requested review from itsankit-google and removed request for itsankit-google July 1, 2025 05:34
Copy link
Member

@itsankit-google itsankit-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a unit test?

Copy link
Member

@itsankit-google itsankit-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AnkitCLI AnkitCLI force-pushed the AddingRetryWithFailsafe branch from dea8c9c to acd0f0d Compare July 8, 2025 05:44
@AnkitCLI AnkitCLI merged commit e4aa851 into data-integrations:develop Jul 8, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Trigger unit test build

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants