Skip to content

Virtual child contexts emit FAIL checkpoints in FLAT mode #362

@yaythomas

Description

@yaythomas

Summary

When a virtual child context (a FLAT-mode map or parallel branch) fails, the Python SDK writes a FAIL checkpoint for the branch. The SDK does not write corresponding START or SUCCEED checkpoints for virtual branches.

Expected behavior

A virtual branch should emit no lifecycle checkpoints at all, regardless of outcome. This matches the JS reference implementation, which suppresses START, SUCCEED, and FAIL writes uniformly when isVirtual === true:

aws/aws-durable-execution-sdk-js, src/handlers/run-in-child-context-handler/run-in-child-context-handler.ts:

  • START gated on !isVirtual at line 287
  • SUCCEED gated on !isVirtual at line 362
  • FAIL gated on !isVirtual at line 401

Link: https://github.com/aws/aws-durable-execution-sdk-js/blob/main/packages/aws-durable-execution-sdk-js/src/handlers/run-in-child-context-handler/run-in-child-context-handler.ts#L401

Actual behavior (current Python, main)

src/aws_durable_execution_sdk_python/operation/child.py, ChildOperationExecutor.execute: the exception handler writes a create_context_fail update unconditionally. The self.config.is_virtual check that guards START (in check_result_status) and SUCCEED (in the success path of execute) is not applied to the FAIL write. When a FLAT branch fails, the Python SDK sends a CONTEXT/FAIL OperationUpdate for an operation id that was never started.

Reproduction

Run a map or parallel with nesting_type=NestingType.FLAT where at least one branch raises an exception. Inspect execution history — a FAILED CONTEXT operation appears for each failed branch, with startedTimestamp == closedTimestamp. In the DurableExecutionsWorkerService metrics, numBillableOperations will have incremented for each of these phantom entries.

Fix

Gate the FAIL checkpoint on the same is_virtual condition that guards START and SUCCEED, so all three lifecycle entries are uniformly suppressed for virtual contexts. The exception must still propagate to the concurrency executor so it can record the branch's failure in the BatchResult and apply its completion-tolerance logic.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions