- 
                Notifications
    You must be signed in to change notification settings 
- Fork 86
[PLUGIN-1808] Retry policy to service account for 5xx errors for bigquery plugin #1544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PLUGIN-1808] Retry policy to service account for 5xx errors for bigquery plugin #1544
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a second thought, I feel just move all the retry configs & retry policy to ServiceAccountAccessTokenProvider class.
For start, we can use default configs and no need to make them configurable.
This will make the PR look cleaner and changes will be limited to one class.
|  | ||
| private BigQuerySourceConfig(@Nullable BigQueryConnectorConfig connection, @Nullable String dataset, | ||
| @Nullable String cmekKey, @Nullable String bucket, @Nullable String table) { | ||
| @Nullable String cmekKey, @Nullable String bucket, @Nullable String table) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation fix
7eb0bb3    to
    46c5d1e      
    Compare
  
    | } | ||
| return new AccessToken(token.getTokenValue(), token.getExpirationTime().getTime()); | ||
| }); | ||
| } catch (Exception e) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract cause from FailesafeException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| new ErrorCategory(ErrorCategoryEnum.PLUGIN), | ||
| "Unable to get service account access token after retries.", | ||
| e.getMessage(), | ||
| ErrorType.UNKNOWN, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for server error use SYSTEM error type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| private Configuration conf; | ||
| private GoogleCredentials credentials; | ||
| private static final Gson GSON = new Gson(); | ||
| private static final Logger logger = LoggerFactory.getLogger(ServiceAccountAccessTokenProvider.class); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually use private static final variables in uppercase: logger -> LOG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
        
          
                src/main/java/io/cdap/plugin/gcp/gcs/ServiceAccountAccessTokenProvider.java
          
            Show resolved
            Hide resolved
        
      | int initialRetryDuration = DEFAULT_INITIAL_RETRY_DURATION_SECONDS; | ||
| int maxRetryCount = DEFAULT_MAX_RETRY_COUNT; | ||
| int maxRetryDuration = DEFAULT_MAX_RETRY_DURATION_SECONDS; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we creating copy of instance variables here? It eliminates the purpose of them being final & static
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| private RetryPolicy<Object> getRetryPolicy(int initialRetryDuration, int maxRetryDuration, | ||
| int maxRetryCount) { | ||
| return RetryPolicy.builder() | ||
| .handle(ServerErrorException.class) | ||
| .withBackoff(Duration.ofSeconds(initialRetryDuration), Duration.ofSeconds(maxRetryDuration)) | ||
| .withMaxRetries(maxRetryCount) | ||
| .onRetry(event -> LOG.debug("Retry attempt {} due to {}", event.getAttemptCount(), event.getLastException(). | ||
| getMessage())) | ||
| .onSuccess(event -> LOG.debug("Access Token Fetched Successfully.")) | ||
| .onRetriesExceeded(event -> LOG.error("Retry limit reached for Service account.")) | ||
| .build(); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be a private static method & retry policy can have a static final instance variable rather than creating retry policy for every call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
d52b351    to
    1011aff      
    Compare
  
    | new ErrorCategory(ErrorCategoryEnum.PLUGIN), | ||
| "Unable to get service account access token after retries.", | ||
| t.getMessage(), | ||
| ErrorType.SYSTEM, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getCause() should match first check whether it was a ServerException then only use SYSTEM otherwise UNKNOWN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| throw ErrorUtils.getProgramFailureException(new ErrorCategory(ErrorCategoryEnum.PLUGIN), | ||
| "Unable to refresh service account access token.", e.getMessage(), | ||
| ErrorType.UNKNOWN, true, e); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use
google-cloud/src/main/java/io/cdap/plugin/gcp/common/GCPErrorDetailsProviderUtil.java
Line 117 in c13bece
| public static ProgramFailureException getHttpResponseExceptionDetailsFromChain(Throwable e, String errorReason, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
| }); | ||
| } catch (FailsafeException e) { | ||
| Throwable t = e.getCause() != null ? e.getCause() : e; | ||
| throw ErrorUtils.getProgramFailureException( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use
google-cloud/src/main/java/io/cdap/plugin/gcp/common/GCPErrorDetailsProviderUtil.java
Line 117 in c13bece
| public static ProgramFailureException getHttpResponseExceptionDetailsFromChain(Throwable e, String errorReason, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
1011aff    to
    4bf53f8      
    Compare
  
    | public static final int BQ_DEFAULT_READ_TIMEOUT_SECONDS = 120; | ||
| public static final String DATASTORE_SUPPORTED_DOC_URL = "https://cloud.google.com/datastore/docs/concepts/errors"; | ||
| public static final String BIG_TABLE_SUPPORTED_DOC_URL = "https://cloud.google.com/bigtable/docs/status-codes"; | ||
| public static final String SERVER_ERROR_SUPPORTED_DOC_URL = | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCE_METADATA_SERVER_ERROR_SUPPORTED_DOC_URL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
| return msg != null && msg.matches("^5\\d{2}$"); // crude check for 5xx codes | ||
| } | ||
|  | ||
| private com.google.auth.oauth2.AccessToken safeGetAccessToken() throws IOException { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do we mean by safeGetAccessToken here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually method names should indicate what are we trying to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to retrieveAccessToken
| LOG.debug( | ||
| "Initializing RetryPolicy with the following configuration: MaxRetryCount: {}, InitialRetryDuration: {}s, " + | ||
| "MaxRetryDuration: {}s", DEFAULT_MAX_RETRY_COUNT, DEFAULT_INITIAL_RETRY_DURATION_SECONDS, | ||
| DEFAULT_MAX_RETRY_DURATION_SECONDS); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need of this log for every access token fetch.
This can be present in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
| @Override | ||
| public void refresh() throws IOException { | ||
| try { | ||
| getCredentials().refresh(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add retries on getCredentials().refresh()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|  | ||
| private boolean isServerError(IOException e) { | ||
| String msg = e.getMessage(); | ||
| return msg != null && msg.matches("^5\\d{2}$"); // crude check for 5xx codes | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should broaden the matching regex to match something like :
Unexpected Error code 5xx trying to get security access token from Compute Engine metadata for the default service account
And we could also keep the compiled pattern as prival final static variable instead of compiling it for every call.
Regex matches are cheap but pattern compilation is costly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
| public static final int DEFAULT_MAX_RETRY_COUNT = 5; | ||
| public static final int DEFAULT_MAX_RETRY_DURATION_SECONDS = 80; | ||
| private static final RetryPolicy<Object> RETRY_POLICY = createRetryPolicy(); | ||
| private static final Pattern SERVER_ERROR_PATTERN = Pattern.compile(".*5\\d{2}.*"); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still not addressed : #1544 (comment)
We are still looking at very broad pattern of just finding 5xx in the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern is broad as we are not sure of the error message and are only matching on 5xx.
Is there a way get the error message structure to narrow this regex match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already know from where the error message is coming. Please look at the stacktrace of errors in the JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added done
        
          
                src/main/java/io/cdap/plugin/gcp/gcs/ServiceAccountAccessTokenProvider.java
          
            Show resolved
            Hide resolved
        
              
          
                src/main/java/io/cdap/plugin/gcp/gcs/ServiceAccountAccessTokenProvider.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      39dab19    to
    2cd1c91      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a unit test?
        
          
                src/main/java/io/cdap/plugin/gcp/gcs/ServiceAccountAccessTokenProvider.java
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also add unit tests to check retries, for example : https://github.com/cdapio/cdap/pull/15226/files#diff-acc8b9b649f280ee45108ce0220bc742b327acd041f1c1022489452f021ad852R53-R69
dea8c9c    to
    acd0f0d      
    Compare
  
    
PLUGIN-1808 Added a retry policy to handle 5xx server errors when fetching the service account access token in the BigQuery plugin