Skip to content

fix: Classify non-retryable customer errors for Durable API calls#358

Open
denzyll wants to merge 5 commits intomainfrom
feat/handle-non-retryable-kms-exceptions
Open

fix: Classify non-retryable customer errors for Durable API calls#358
denzyll wants to merge 5 commits intomainfrom
feat/handle-non-retryable-kms-exceptions

Conversation

@denzyll
Copy link
Copy Markdown

@denzyll denzyll commented Apr 28, 2026

Issue #, if available:

N/A

Description of changes:

Non-retryable customer errors from Lambda (e.g., KMSAccessDeniedException, KMSDisabledException) arrive as HTTP 502 during CheckpointDurableExecution and GetDurableExecutionState API calls. Without this change, these 502s are classified as retryable invocation errors, causing the SDK to retry invocations that will never self-resolve.

Extract shared classification logic into BotoClientError with a _NON_RETRYABLE_CUSTOMER_ERROR_CODES set and _classify_error_category method. Both CheckpointError and GetExecutionStateError now inherit from_exception() with unified classification so that non-retryable errors (KMS 502s, 4xx client errors) return Status: FAILED immediately, while retryable errors (5xx, 429, network errors) continue to raise for Lambda retry.

Add is_retryable() to InvocationError hierarchy so execution.py handlers use a single interface instead of isinstance checks.

Add backward-compatible aliases for CheckpointErrorCategory and is_retriable().

Testing — parameterized classification tests across both error types for all four KMS codes, 4xx, 429, 5xx, and 502 scenarios. Integration tests for non-retryable/retryable errors across all three execution.py code paths: initial pagination, background thread, and user thread.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Comment thread src/aws_durable_execution_sdk_python/exceptions.py
Comment thread src/aws_durable_execution_sdk_python/exceptions.py
invocation_input.checkpoint_token,
invocation_input.initial_execution_state.next_marker,
)
except BotoClientError as e:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this error handling logic be in the method fetch_paginated_operations? fetch_paginated_operations is called in two places

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All decisions to fail the execution or retry the invocation are made in the execution file.

result=serialized_result
).to_dict()

except BackgroundThreadError as bg_error:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why do we wrap checkpoint error with this error type BackgroundThreadError?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use it to make sure the error is propagated from the background checkpointing thread to the handler thread in execution.py for error handling.

# Background checkpointing errors will propagate through CompletionEvent.wait() as BackgroundThreadError

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the BackgroundThreadError docstring has more details:

It says that BackgroundThreadError inherits from BaseException to bypass normal exception handlers.

Comment thread src/aws_durable_execution_sdk_python/execution.py
denzyll added 5 commits April 28, 2026 20:12
Non-retriable customer errors from Lambda (e.g., KMSAccessDeniedException,
KMSDisabledException) arrive as HTTP 502 during CheckpointDurableExecution
and GetDurableExecutionState API calls. Without this change, these 502s
are classified as retriable invocation errors, causing the SDK to retry
invocations that will never self-resolve.

Extract shared classification logic into BotoClientError with a
_NON_RETRIABLE_CUSTOMER_ERROR_CODES set and _classify_error_category
method. Both CheckpointError and GetExecutionStateError now inherit
from_exception() with unified classification so that non-retriable
errors (KMS 502s, 4xx client errors) return Status: FAILED immediately,
while retriable errors (5xx, 429, network errors) continue to raise
for Lambda retry.

Add is_retriable() to InvocationError hierarchy so execution.py
handlers use a single interface instead of isinstance checks.

Testing — parameterized classification tests across both error types
for all four KMS codes, 4xx, 429, 5xx, and 502 scenarios. Integration
tests for non-retriable/retriable errors across all three execution.py
code paths: initial pagination, background thread, and user thread.

By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
@denzyll denzyll force-pushed the feat/handle-non-retryable-kms-exceptions branch from dce0b9d to ad94b08 Compare April 29, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants