Skip to content

fix(ship,finalize-pr): enforce explicit loops, async waits, and result categorization#323

Open
JacobPEvans wants to merge 3 commits into
mainfrom
fix/ship-finalize-explicit-loops
Open

fix(ship,finalize-pr): enforce explicit loops, async waits, and result categorization#323
JacobPEvans wants to merge 3 commits into
mainfrom
fix/ship-finalize-explicit-loops

Conversation

@JacobPEvans
Copy link
Copy Markdown
Owner

Summary

Closes a class of premature-exit failures where /ship and /finalize-pr would return a PR as "ready to merge" when in reality CodeQL was still pending, a subagent silently failed, or the Phase 3 → Phase 2 loop terminated early because the loop logic was prose ("return to Phase 2") with no enforcement.

Why

  • The user reported /ship "ending pre-maturely recently (only sometimes) and not leaving FULLY mergeable PRs"
  • Source review of ship/SKILL.md and finalize-pr/SKILL.md identified five distinct root causes:
    1. Phase 3 → Phase 2 loop is prose with no enforced iteration counter — subagents treat it as advisory
    2. Pending async checks (CodeQL, required-reviewer) trigger Phase 3 abort, but Phase 2 has nothing to fix; agent bails
    3. After /resolve-codeql fix and /resolve-pr-threads, skill advances without re-querying state — silent subagent failures pass through
    4. Background CI Task agent can die without propagating; Phase 2.3 reads stale state
    5. All Phase 3 aborts treated identically; no taxonomy of fixable vs wait vs human-required failures

What's in this PR

finalize-pr/SKILL.md

  • Phase 2.0: explicit phase_2_iteration counter; cap at 5; state-dump exit on overflow
  • Phase 2.1: background CI Task agent failure fallback to direct gh pr checks --watch poll
  • Phase 2.2: post-fix verification mandatory after /resolve-codeql and /resolve-pr-threads (re-query state, assert delta, track noop counts)
  • Phase 2.6: wait-for-async-checks polling step (5-min cap) before Phase 3 — "pending checks are time-passage problems, not failures"
  • Phase 3.1.1: failure taxonomy table dispatching each (mergeStateStatus, reviewDecision, statusCheckRollup.state, reviewThreads) combination to one of four handlers
  • Stop Condition: replaced prose with explicit pseudocode + result category enum
  • Phase 5: result category surfaced at top of summary; precise format; non-ready categories must include specific failed gate and manual unblock action

ship/SKILL.md

  • Step 3.1: post-finalize 60-second wait if recent push (GitHub mergeStateStatus is async)
  • Step 3.2: explicit retry cap of 3 /finalize-pr re-invocations per PR per /ship invocation
  • Step 3.3: required result categorization (Ready to merge / Ready except human gate / Ship aborted) surfaced in Ship Summary

User-visible outcome

Every /ship invocation reports one of three categories with a specific reason:

  • Ready to merge — all gates clean, safe for human merge
  • Ready except human gate — only blocker is REVIEW_REQUIRED or isDraft=true
  • Ship aborted — specific gate stuck, with the manual action that would unblock it

Silent "Ready to merge" while actually blocked is no longer possible.

Test plan

  • CI green (markdown lint + agentskills.io validator)
  • Manual: run /ship on a PR with passing CI and clean CodeQL — expect Ready to merge
  • Manual: run /ship on a PR with required human reviewer who hasn't approved — expect Ready except human gate (no infinite loop)
  • Manual: run /ship on a PR with a CI failure the agent can't auto-fix — expect Ship aborted with the specific check named and a manual-action suggestion

Related work

  • Documentation for /ship lifecycle including these failure modes lands in JacobPEvans/docs #30
  • The cloud half of the issue→PR pipeline (ai-workflows callers on docs repo) is filed as a follow-up PR

Assisted-by: Claude noreply@anthropic.com

…t categorization

Closes a class of premature-exit failures where /ship and /finalize-pr would
return a PR as "ready to merge" when in reality CodeQL was still pending,
a subagent silently failed, or the Phase 3 -> Phase 2 loop terminated early
because the loop logic was prose ("return to Phase 2") with no enforcement.

finalize-pr/SKILL.md changes:

- Phase 2.0: explicit phase_2_iteration counter; hard cap at 5; on the 6th
  entry emit a state dump (gate result, alert count, threads, CI rollup,
  HEAD sha), tag the PR aborted_iteration_cap, jump to Phase 5
- Phase 2.1: if the background CI Task agent errors, fall back to direct
  `gh pr checks --watch` polling with a 10-minute timeout; log the agent
  failure visibly
- Phase 2.2: post-fix verification mandatory after /resolve-codeql fix and
  /resolve-pr-threads -- re-query alert count and unresolved-thread count,
  assert a strict decrease, and track per-fix noop counters that exit to
  Phase 5 with needs_human after 2 consecutive no-op subagent calls
- Phase 2.6: new wait-for-async-checks step that polls pr-checks for up to
  5 minutes until no PENDING bucket remains, and separately re-queries the
  code-scanning alerts (CodeQL is not in pr-checks). Pending checks are
  time-passage problems, not failures.
- Phase 3.1.1: failure taxonomy table that dispatches each (mergeStateStatus,
  reviewDecision, statusCheckRollup.state, reviewThreads) combination to one
  of four handlers: fixable_loop_to_phase_2, wait_and_recheck,
  needs_human_exit_phase_5, hard_block_exit_phase_5
- Stop Condition: replaced prose with explicit pseudocode showing the loop
  bounds, the four handler dispatch, and the five result categories (ready,
  ready_except_human_gate, needs_human, aborted_iteration_cap, hard_block)
- Phase 5: result category must be surfaced at top of summary; format is
  precise; non-ready categories must include the specific failed gate and
  the manual action that would unblock it

ship/SKILL.md changes:

- Step 3.1: post-finalize wait if the most recent commit on the branch
  landed in the last 60 seconds (GitHub's mergeStateStatus is async)
- Step 3.2: explicit retry cap of 3 /finalize-pr re-invocations per PR
  per /ship invocation; do not silently report Ready after the cap
- Step 3.3: required result categorization (Ready to merge / Ready except
  human gate / Ship aborted) surfaced in the Ship Summary; never silently
  elide a non-ready result

User-visible outcome: every /ship invocation reports one of three categories
with a specific reason. Silent "Ready to merge" while actually blocked is no
longer possible -- the iteration cap and result-category dispatch forces an
explicit terminal state with an actionable next step.

Assisted-by: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the finalize-pr and ship skills by introducing iteration limits, explicit failure handling taxonomies, and mandatory wait periods for asynchronous GitHub checks to improve reliability. Review feedback highlights several technical improvements: correcting the gh pr checks command which lacks an --interval flag, using GraphQL variables for safer API calls, adopting canonical CodeQL query patterns for better error handling, and replacing platform-specific date commands with portable jq alternatives.

Comment thread github-workflows/skills/finalize-pr/SKILL.md Outdated
Comment thread github-workflows/skills/finalize-pr/SKILL.md Outdated
Comment thread github-workflows/skills/finalize-pr/SKILL.md Outdated
Comment thread github-workflows/skills/ship/SKILL.md Outdated
Four fixes from PR #323 review:

1. finalize-pr/SKILL.md (Phase 2.1 CI poll fallback): drop the
   --interval 30 flag from gh pr checks --watch. The flag does not
   exist in gh CLI and would cause the command to fail at runtime.

2. finalize-pr/SKILL.md (Phase 2.2 review-thread post-fix verification):
   switch the inline GraphQL query to use $owner/$repo/$prNumber
   variables passed via -f/-F flags (the canonical pattern from
   gh-cli-patterns) instead of string interpolation. Safer for special
   characters, matches the rest of the repo.

3. finalize-pr/SKILL.md (Phase 2.6 wait-for-async-checks): switch from
   an inline jq filter to the canonical code-scanning alert query from
   gh-cli-patterns:
     gh api 'repos/<OWNER>/<REPO>/code-scanning/alerts?state=open&per_page=100' --jq 'length' || echo "0"
   Server-side state filter, graceful 404 handling, fewer rows over
   the wire.

4. ship/SKILL.md (Step 3.1 post-finalize wait): replace BSD-specific
   date -j -f with the portable jq fromdate filter. The skill runs on
   both macOS and CI Linux runners; the BSD flags would break on
   GNU/Linux. Also normalized last_commit_ts -> lastCommitTs for
   camelCase consistency.

Assisted-by: Claude <noreply@anthropic.com>
PR #323 originally added +240/-9 lines to encode five failure-mode fixes
as a 14-row failure taxonomy, a 5-value result category enum, a pseudocode
Stop Condition block duplicating prose above it, and per-section post-fix
verification blocks for CodeQL and threads. The mechanics underneath were
small: an iteration counter, a pending-check drain, a re-query after every
/resolve-*, and a CI-monitor fallback.

This refactor keeps every load-bearing mechanic but replaces the verbose
machinery with three load-bearing directives at the top of Phase 2:

- NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE
  (verbatim from superpowers:verification-before-completion)
- PENDING IS NOT FAILURE
- If a /resolve-* call did not strictly decrease the corresponding count,
  the subagent did nothing

Changes vs PR #323 head:

- Delete Phase 3.1.1 failure taxonomy table (~28 lines)
- Delete Stop Condition pseudocode + result category enum (~50 lines)
- Delete Phase 5 result category line-format table (~15 lines)
- Fold per-section CodeQL/threads post-fix verify blocks (~24 lines) into
  one shared Step 2.2.5 referencing /gh-cli-patterns queries
- Slim Phase 2.0 iteration counter, Phase 2.1 fallback, Phase 2.6 drain
- Replace Step 3.3 result-category table in /ship with a one-line "surface
  whatever /finalize-pr reported, do not re-classify"
- Move the GraphQL unresolved-thread-count query to /gh-cli-patterns as a
  new Canonical Unresolved Thread Count subsection so other skills can
  reference it

All five failure modes remain fixed:
1. iteration counter capped at 5 (mode 1)
2. Phase 2.6 pending-check drain (mode 2)
3. iron-law sentence + Step 2.2.5 re-query (mode 3)
4. Phase 2.1 gh pr checks --watch fallback (mode 4)
5. Phase 5 Blocked on: <gate> line replaces the category enum (mode 5)

Net delta vs main: +231 → +122 (47% reduction).

Assisted-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant