fix: Classify non-retryable customer errors for Durable API calls#358
fix: Classify non-retryable customer errors for Durable API calls#358
Conversation
| invocation_input.checkpoint_token, | ||
| invocation_input.initial_execution_state.next_marker, | ||
| ) | ||
| except BotoClientError as e: |
There was a problem hiding this comment.
shouldn't this error handling logic be in the method fetch_paginated_operations? fetch_paginated_operations is called in two places
There was a problem hiding this comment.
All decisions to fail the execution or retry the invocation are made in the execution file.
| result=serialized_result | ||
| ).to_dict() | ||
|
|
||
| except BackgroundThreadError as bg_error: |
There was a problem hiding this comment.
Just curious, why do we wrap checkpoint error with this error type BackgroundThreadError?
There was a problem hiding this comment.
We use it to make sure the error is propagated from the background checkpointing thread to the handler thread in execution.py for error handling.
# Background checkpointing errors will propagate through CompletionEvent.wait() as BackgroundThreadError
There was a problem hiding this comment.
Also, the BackgroundThreadError docstring has more details:
It says that BackgroundThreadError inherits from BaseException to bypass normal exception handlers.
Non-retriable customer errors from Lambda (e.g., KMSAccessDeniedException, KMSDisabledException) arrive as HTTP 502 during CheckpointDurableExecution and GetDurableExecutionState API calls. Without this change, these 502s are classified as retriable invocation errors, causing the SDK to retry invocations that will never self-resolve. Extract shared classification logic into BotoClientError with a _NON_RETRIABLE_CUSTOMER_ERROR_CODES set and _classify_error_category method. Both CheckpointError and GetExecutionStateError now inherit from_exception() with unified classification so that non-retriable errors (KMS 502s, 4xx client errors) return Status: FAILED immediately, while retriable errors (5xx, 429, network errors) continue to raise for Lambda retry. Add is_retriable() to InvocationError hierarchy so execution.py handlers use a single interface instead of isinstance checks. Testing — parameterized classification tests across both error types for all four KMS codes, 4xx, 429, 5xx, and 502 scenarios. Integration tests for non-retriable/retriable errors across all three execution.py code paths: initial pagination, background thread, and user thread. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
dce0b9d to
ad94b08
Compare
Issue #, if available:
N/A
Description of changes:
Non-retryable customer errors from Lambda (e.g., KMSAccessDeniedException, KMSDisabledException) arrive as HTTP 502 during CheckpointDurableExecution and GetDurableExecutionState API calls. Without this change, these 502s are classified as retryable invocation errors, causing the SDK to retry invocations that will never self-resolve.
Extract shared classification logic into BotoClientError with a _NON_RETRYABLE_CUSTOMER_ERROR_CODES set and _classify_error_category method. Both CheckpointError and GetExecutionStateError now inherit from_exception() with unified classification so that non-retryable errors (KMS 502s, 4xx client errors) return Status: FAILED immediately, while retryable errors (5xx, 429, network errors) continue to raise for Lambda retry.
Add is_retryable() to InvocationError hierarchy so execution.py handlers use a single interface instead of isinstance checks.
Add backward-compatible aliases for CheckpointErrorCategory and is_retriable().
Testing — parameterized classification tests across both error types for all four KMS codes, 4xx, 429, 5xx, and 502 scenarios. Integration tests for non-retryable/retryable errors across all three execution.py code paths: initial pagination, background thread, and user thread.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.