Skip to content

Virtual Context bug on parent id and FAIL checkpoint emitting#364

Open
yaythomas wants to merge 2 commits intomainfrom
refactor/virtual-context
Open

Virtual Context bug on parent id and FAIL checkpoint emitting#364
yaythomas wants to merge 2 commits intomainfrom
refactor/virtual-context

Conversation

@yaythomas
Copy link
Copy Markdown
Contributor

@yaythomas yaythomas commented Apr 29, 2026

Description of changes:

Fixes two functional bugs in virtual child contexts and refactors the supporting mechanism.

#363run_in_child_context(is_virtual=True) sends invalid parent_id to the backend. On main, inner operations inside a virtual run_in_child_context stamp the branch's own operation id as their parent_id. The branch itself writes no START/SUCCEED under is_virtual=True, so that id does not exist in the execution history. The backend correctly rejects the checkpoint with InvalidParameterValueException(CHECKPOINT_INVALID_PARENT_OPERATION_ID), and the user's execution fails at the first inner checkpoint. Nesting virtual scopes (and FLAT map/parallel inside a virtual scope) compounds the failure. After this change, inner operations stamp the enclosing non-virtual ancestor, which is a real checkpointed parent.

#362 — virtual branches wrote a FAIL checkpoint on exceptions. On main, an exception inside a virtual branch caused ChildOperationExecutor.execute to unconditionally write a FAIL checkpoint for the branch — producing a phantom FAILED CONTEXT entry in the execution history with no matching START. On the FLAT map/parallel path this was cosmetic (the FAIL's parent id was the valid map/parallel op, the backend accepts a CONTEXT FAIL without a prior START); on the run_in_child_context(is_virtual=True) path it compounded #363 (the FAIL's parent id was the same orphan branch id, which the backend rejects). Either way it's incorrect: virtual branches should emit no lifecycle entries at all. After this change, virtual branches emit no lifecycle checkpoints; failures still propagate to the concurrency executor's BatchResult (or to the caller of run_in_child_context), and completion-tolerance logic still applies.

Mechanism changes (applied alongside the fixes):

  • Single source of truth for the virtual-vs-real decision. create_child_context computes two fields (_parent_id, _step_id_prefix) at construction; no per-operation-method knowledge is required. New operations just read self._parent_id and work correctly under both modes.
  • Field names match their roles. _parent_id is "the id my inner operations stamp as their parent"; _step_id_prefix is "how I prefix step ids". Each field has one job.
  • is_virtual is encapsulated in the context as a cached property. Callers opt in with create_child_context(..., is_virtual=True); the property makes the state inspectable.
  • Nested virtual-in-virtual matches the JS reference. A virtual child of a virtual parent inherits its parent's reporting ancestor, so chained FLAT layers collapse to the outermost non-virtual context without dangling parent-id references.
  • ChildConfig.is_virtual drives lifecycle-checkpoint suppression in ChildOperationExecutor (START, SUCCEED, FAIL) and, via run_in_child_context, the child context's own virtual-ness. The field remains public, matching the JS SDK's ChildConfig.virtualContext.
  • Fewer parameters threaded through. operation_identifier is gone from ConcurrentExecutor, MapExecutor, and ParallelExecutor constructors and from_items/from_callables; the concurrency layer no longer needs an OperationIdentifier to figure out the reporting parent.
  • Tests exercise the invariants directly with a real DurableContext and assert wire-format decisions, including nested-virtual-in-virtual coverage.

Impact:

  • Users can call ctx.run_in_child_context(config=ChildConfig(is_virtual=True)) without their execution being rejected at checkpoint time.
  • FLAT map/parallel no longer emits phantom FAILED CONTEXT entries on branch failure.
  • Python SDK virtual-context semantics now match the JS reference SDK exactly (same parent-id propagation rule; nested virtuals collapse identically).
  • Cost: eliminates billable FAILED CONTEXT entries that were previously emitted for failed virtual branches.

Also includes minor CONTRIBUTING.md improvement (hatch venv symlink tip for editors that mangle paths with spaces).

Issue #, if available:

Fixes #362, Fixes #363

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Virtual child contexts (FLAT-mode map/parallel branches) no longer
write FAIL checkpoints when the user function raises. The branch is
a logical scope only; it does not appear in the execution history
regardless of outcome, aligning with the JS reference SDK.

Also fixes an incoherent state when a user set
ChildConfig.is_virtual=True via run_in_child_context: lifecycle
checkpoints were suppressed, but the child context's _parent_id was
the child's own operation id (never announced in the checkpoint
stream), so inner operations stamped a parent_id pointing to a
dangling reference. Nesting produced a chain of such references.
The two decisions (lifecycle suppression, parent-id propagation) are
now coupled through ChildConfig.is_virtual.

Refactor of the supporting mechanism:

- Single source of truth for the virtual-vs-real decision.
  create_child_context computes two fields (_parent_id,
  _step_id_prefix) at construction; no per-operation-method
  knowledge is required. New operations just read self._parent_id
  and work correctly under both modes.
- Field names match their roles. _parent_id is "the id my inner
  operations stamp as their parent"; _step_id_prefix is "how I
  prefix step ids". Each field has one job.
- is_virtual is encapsulated in the context as a cached property.
  Callers opt in with create_child_context(..., is_virtual=True);
  the property makes the state inspectable.
- Nested virtual-in-virtual matches the JS reference. A virtual
  child of a virtual parent inherits its parent's reporting
  ancestor, so chained FLAT layers collapse to the outermost
  non-virtual context without dangling parent-id references.
- ChildConfig.is_virtual drives lifecycle-checkpoint suppression in
  ChildOperationExecutor (START, SUCCEED, FAIL) and, via
  run_in_child_context, the child context's own virtual-ness. The
  field remains public, matching the JS SDK's
  ChildConfig.virtualContext.
- Fewer parameters threaded through. operation_identifier is gone
  from ConcurrentExecutor, MapExecutor, and ParallelExecutor
  constructors and from_items/from_callables; the concurrency
  layer no longer needs an OperationIdentifier to figure out the
  reporting parent.
- Tests exercise the invariants directly with a real DurableContext
  and assert wire-format decisions, including nested-virtual-in-
  virtual coverage.

Improves observability (no phantom FAILED CONTEXT entries for
virtual branches), cost (no billable operation per failed virtual
branch), and cross-SDK wire parity.

fixes #362, fixes #363
Kiro and VS Code mangle the hatch interpreter path when it contains
spaces, breaking "Select Interpreter". Document the `.venv` symlink
workaround and split the existing VS Code section into Interpreter
and Linting subsections.
When both 'serdes' and 'item_serdes' are provided:
- item_serdes: Used for individual item results in child contexts
- serdes: Used for the entire BatchResult at handler level

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove is_virtual from ChildConfig. It's a mistake to expose this field to users. This is supposed to be used only by concurrency operations internally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

virtual child context orphaned parent_ids in execution history Virtual child contexts emit FAIL checkpoints in FLAT mode

2 participants