Summary
When a virtual child context (a FLAT-mode map or parallel branch) fails, the Python SDK writes a FAIL checkpoint for the branch. The SDK does not write corresponding START or SUCCEED checkpoints for virtual branches.
Expected behavior
A virtual branch should emit no lifecycle checkpoints at all, regardless of outcome. This matches the JS reference implementation, which suppresses START, SUCCEED, and FAIL writes uniformly when isVirtual === true:
aws/aws-durable-execution-sdk-js, src/handlers/run-in-child-context-handler/run-in-child-context-handler.ts:
- START gated on
!isVirtual at line 287
- SUCCEED gated on
!isVirtual at line 362
- FAIL gated on
!isVirtual at line 401
Link: https://github.com/aws/aws-durable-execution-sdk-js/blob/main/packages/aws-durable-execution-sdk-js/src/handlers/run-in-child-context-handler/run-in-child-context-handler.ts#L401
Actual behavior (current Python, main)
src/aws_durable_execution_sdk_python/operation/child.py, ChildOperationExecutor.execute: the exception handler writes a create_context_fail update unconditionally. The self.config.is_virtual check that guards START (in check_result_status) and SUCCEED (in the success path of execute) is not applied to the FAIL write. When a FLAT branch fails, the Python SDK sends a CONTEXT/FAIL OperationUpdate for an operation id that was never started.
Reproduction
Run a map or parallel with nesting_type=NestingType.FLAT where at least one branch raises an exception. Inspect execution history — a FAILED CONTEXT operation appears for each failed branch, with startedTimestamp == closedTimestamp. In the DurableExecutionsWorkerService metrics, numBillableOperations will have incremented for each of these phantom entries.
Fix
Gate the FAIL checkpoint on the same is_virtual condition that guards START and SUCCEED, so all three lifecycle entries are uniformly suppressed for virtual contexts. The exception must still propagate to the concurrency executor so it can record the branch's failure in the BatchResult and apply its completion-tolerance logic.
Summary
When a virtual child context (a FLAT-mode
maporparallelbranch) fails, the Python SDK writes aFAILcheckpoint for the branch. The SDK does not write correspondingSTARTorSUCCEEDcheckpoints for virtual branches.Expected behavior
A virtual branch should emit no lifecycle checkpoints at all, regardless of outcome. This matches the JS reference implementation, which suppresses START, SUCCEED, and FAIL writes uniformly when
isVirtual === true:aws/aws-durable-execution-sdk-js,src/handlers/run-in-child-context-handler/run-in-child-context-handler.ts:!isVirtualat line 287!isVirtualat line 362!isVirtualat line 401Link: https://github.com/aws/aws-durable-execution-sdk-js/blob/main/packages/aws-durable-execution-sdk-js/src/handlers/run-in-child-context-handler/run-in-child-context-handler.ts#L401
Actual behavior (current Python,
main)src/aws_durable_execution_sdk_python/operation/child.py,ChildOperationExecutor.execute: the exception handler writes acreate_context_failupdate unconditionally. Theself.config.is_virtualcheck that guards START (incheck_result_status) and SUCCEED (in the success path ofexecute) is not applied to the FAIL write. When a FLAT branch fails, the Python SDK sends a CONTEXT/FAILOperationUpdatefor an operation id that was never started.Reproduction
Run a
maporparallelwithnesting_type=NestingType.FLATwhere at least one branch raises an exception. Inspect execution history — a FAILED CONTEXT operation appears for each failed branch, withstartedTimestamp == closedTimestamp. In the DurableExecutionsWorkerService metrics,numBillableOperationswill have incremented for each of these phantom entries.Fix
Gate the FAIL checkpoint on the same
is_virtualcondition that guards START and SUCCEED, so all three lifecycle entries are uniformly suppressed for virtual contexts. The exception must still propagate to the concurrency executor so it can record the branch's failure in theBatchResultand apply its completion-tolerance logic.