fix(bigquery): retry getTable server errors#13383
Conversation
Translate HTTP response exceptions thrown by getTable before retry evaluation so existing BigQueryException retryability handles transient server errors. Keep the change scoped to getTable and leave raw network IOExceptions on the existing retry path. Fixes googleapis#12885
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
There was a problem hiding this comment.
Code Review
This pull request wraps the call to getTableSkipExceptionTranslation in BigQueryImpl with a try-catch block to catch HttpResponseException and rethrow it as a BigQueryException. It also adds a corresponding unit test testGetTableFailureShouldRetryServerErrors to verify that 503 server errors are successfully retried. There are no review comments, so I have no feedback to provide.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request wraps the getTableSkipExceptionTranslation call in BigQueryImpl with a try-catch block to catch HttpResponseException and throw a BigQueryException, enabling transient server errors (such as HTTP 503) to be retried. It also adds a unit test to verify this retry behavior. The reviewer suggests applying this same try-catch wrapping to other *SkipExceptionTranslation methods (e.g., for models, routines, and datasets) to ensure consistent retry behavior across all resource retrieval APIs.
| } catch (HttpResponseException e) { | ||
| throw new BigQueryException(e); | ||
| } |
There was a problem hiding this comment.
While this fix correctly addresses the retry behavior for getTable by wrapping HttpResponseException in BigQueryException, the same pattern of using *SkipExceptionTranslation methods (such as getModelSkipExceptionTranslation, getRoutineSkipExceptionTranslation, and potentially getDatasetSkipExceptionTranslation) is used elsewhere in BigQueryImpl. These other methods likely suffer from the same issue where transient server errors (like HTTP 503) are not retried. Consider applying the same try-catch wrapping to those methods as well to ensure consistent retry behavior across all resource retrieval APIs.
There was a problem hiding this comment.
@lqiu96 As per the gemini's suggestion does below scope make sense?
the current change only covers getTable, but getDataset, getModel and getRoutine use the same pattern - the *SkipExceptionTranslation RPC call runs inside BigQueryRetryHelper.runWithRetries(...), so a raw HttpResponseException can avoid the normal BigQueryException translation before retry handling.
So my plan is to keep the follow-up scoped to those resource getter paths only:
getDatasetgetModelgetRoutine
I am not planning to wrap every *SkipExceptionTranslation call in this PR, since that would widen the behavior change beyond this retry issue.
I will add matching tests for the three paths with a first 503 response and a successful second response, similar to the new getTable coverage.
There was a problem hiding this comment.
Thanks, I think that makes sense. We can create the necessary follow up for the other relevant RPCs. For now, let's try this with getTable.
| try { | ||
| return bigQueryRpc.getTableSkipExceptionTranslation( | ||
| completeTableId.getProject(), | ||
| completeTableId.getDataset(), | ||
| completeTableId.getTable(), | ||
| optionsMap); | ||
| } catch (HttpResponseException e) { | ||
| throw new BigQueryException(e); | ||
| } |
There was a problem hiding this comment.
Thanks for this. IIUC the root of the issue is that the default configuration isn't retrying on a HttpResponseException due to how it's configured in BaseService.
I think this will work, but I have a small concern about behavioral compatibility now that we are throwing a new exception (it may be a small possibility, but I'll need to do a bit more digging into this). BigQuery is an old library and I think it is possible that users may have written a custom ResultRetryAlgorithm to handle these cases since we weren't natively supporting it (e.g. ResultRetryAlgorithm catch specifically for HttpResponseException and parse out the status code). I don't want to wrap the exception and have the behavior changing for existing users.
One idea I have in mind is something for max compatibility is like:
private ResultRetryAlgorithm<Object> getDefaultRetryAlgorithm() {
final ResultRetryAlgorithm<?> configuredAlgorithm = getOptions().getResultRetryAlgorithm();
// 1. If the user configured a custom algorithm, respect it completely (do not apply fallback)
if (configuredAlgorithm != BigQueryBaseService.DEFAULT_BIGQUERY_EXCEPTION_HANDLER) {
ResultRetryAlgorithm<Object> customAlgorithm = (ResultRetryAlgorithm<Object>) configuredAlgorithm;
return customAlgorithm;
}
// 2. If they are using the default handler, wrap it to add transient HTTP retries (like 503)
return new ResultRetryAlgorithm<Object>() {
@Override
public TimedAttemptSettings createNextAttempt(
Throwable previousThrowable, Object previousResponse, TimedAttemptSettings previousSettings) {
return null; // Delegate timing to TimedRetryAlgorithm
}
@Override
public boolean shouldRetry(Throwable previousThrowable, Object previousResponse) {
// Check default transient HTTP rules first
if (previousThrowable instanceof HttpResponseException) {
int statusCode = ((HttpResponseException) previousThrowable).getStatusCode();
if (statusCode == 500 || statusCode == 502 || statusCode == 503 || statusCode == 504) {
return true;
}
}
// Fall back to default ExceptionHandler rules (SocketException, ConnectException, etc.)
ResultRetryAlgorithm<Object> delegate = (ResultRetryAlgorithm<Object>) configuredAlgorithm;
return delegate.shouldRetry(previousThrowable, previousResponse);
}
};
}
And if this makes sense, we use this default for the relevant RPCs (getTable, ... , etc).
Translate HTTP response exceptions thrown by BigQuery.getTable before retry evaluation so the existing BigQueryException retryability rules can retry transient server errors such as HTTP 503.
This keeps the change scoped to getTable. It does not change job retry behavior or raw network IOException handling.
Fixes #12885
Tests:
mvn -pl java-bigquery/google-cloud-bigquery -Dtest=BigQueryImplTest#testGetTableFailureShouldRetryServerErrors -DskipITs test
mvn -pl java-bigquery/google-cloud-bigquery -Dtest=BigQueryImplTest -DskipITs test
mvn -pl java-bigquery/google-cloud-bigquery -DskipITs test
mvn -pl java-bigquery/google-cloud-bigquery -DskipITs package