Skip to content

fix: preserve completed task progress on checkpoint resume#231

Open
jafreck wants to merge 4 commits into
mainfrom
fix/checkpoint-resume-reset
Open

fix: preserve completed task progress on checkpoint resume#231
jafreck wants to merge 4 commits into
mainfrom
fix/checkpoint-resume-reset

Conversation

@jafreck
Copy link
Copy Markdown
Owner

@jafreck jafreck commented Apr 1, 2026

Problem

checkpoint.completeTask() was defined but never called anywhere during Phase 4 execution. This left completedTasks permanently empty in the checkpoint. On resume, filterPhase4CompletedExecutionIds() used this empty set to decide which Phase 4 flow entries to keep — and since no tasks were "completed", it removed all task substep entries, causing the entire task graph to restart from scratch.

This was observed during a live zstd C→Rust migration: after ~46 hours and 43 committed tasks, a kill-and-resume reset the migration to 3 committed tasks.

Fix

Populate completedTasks during Phase 4 (all execution modes):

  • runCommitSubstep() now calls checkpoint.completeTask(task.id) after the code-migrator commit step.
  • The per-task flow's complete step also calls completeTask() for completeness.

Backward-compatible fallback in filterPhase4CompletedExecutionIds:

  • When completedTasks is empty, derives the completed set from the flow checkpoint's own /commit entries (tasks that have a committed substep are treated as completed).
  • Back-fills completedTasks from this derived set so downstream logic stays consistent.

Other changes

  • Fix task-graph-builder.test.ts test schema: add parent_symbol_id column to the symbols table, required by the updated @jafreck/lore package.
  • Assorted agent template and config changes from ongoing migration work.

Testing

  • Added test: should back-fill completedTasks from Phase 4 commit entries when completedTasks is empty
  • All 53 checkpoint tests pass
  • All 48 task-graph-builder tests pass (previously 42 were failing)
  • Full non-e2e test suite: 1467 tests, all pass

jafreck added 4 commits March 28, 2026 18:45
- Add prepareForResume() to clear terminalExhaustion, failed/blocked
  tasks, and stale Phase 4 flow checkpoint entries on reload
- Reset __flowCheckpoint and __phase4FlowCheckpoint status from
  'failed' to 'running' so the Cadre runner re-enters correctly
- Filter Phase 4 completedExecutionIds to only retain substeps for
  fully-completed tasks; failed/in-flight tasks re-enter from scratch
- Reset flow checkpoint error field in resetFromPhase()
- Accumulate outputTokens from assistant.message events in Copilot
  JSONL parser as fallback when usage summary is missing
- Improve Lore MCP tool documentation in agent prompt partial with
  explicit tool names and stronger guidance to prefer Lore over view
- Tune zstd fixture config: maxParallelAgents 12→8, resume false
checkpoint.completeTask() was defined but never called, leaving
completedTasks empty. On resume, filterPhase4CompletedExecutionIds
used the empty set to filter out ALL task substep entries from the
Phase 4 flow checkpoint, causing the entire task graph to restart.

Fix:
- runCommitSubstep now calls checkpoint.completeTask(task.id) so
  completedTasks is populated during Phase 4 (all execution modes).
- filterPhase4CompletedExecutionIds derives completed tasks from
  the flow checkpoint's own /commit entries when completedTasks is
  empty (backward compat for existing checkpoints).
- Add test for back-fill resume path.
- Fix task-graph-builder test schema (add parent_symbol_id column
  required by updated @jafreck/lore).
…-reset

# Conflicts:
#	src/core/checkpoint.ts
#	tests/core/checkpoint.test.ts
#	tests/fixtures/zstd-c-project/migration.config.json
- Add symbol_metrics row for synthetic symbol in kb-server test
- Update semantic search sort order expectation in kb-search-tool test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant