Skip to content

ci: fix propose-fix /tmp prompt bug; extend Reviewer A/B to bot PRs#2534

Merged
bpamiri merged 1 commit into
developfrom
peter/bot-prompt-fix-and-bot-pr-reviews
May 10, 2026
Merged

ci: fix propose-fix /tmp prompt bug; extend Reviewer A/B to bot PRs#2534
bpamiri merged 1 commit into
developfrom
peter/bot-prompt-fix-and-bot-pr-reviews

Conversation

@bpamiri
Copy link
Copy Markdown
Collaborator

@bpamiri bpamiri commented May 10, 2026

Summary

Two related fixes from tonight's pipeline-test session:

  1. Fix the propose-fix /tmp/ prompt bug. Tonight's data finally identified what was actually causing the 15-19 "permission denials" per propose-fix run that we kept attributing to tool allowlist mismatch.
  2. Extend Reviewer A/B to cover the bot's own PRs (even draft). So humans don't merge bot work without analytical context.

What we learned tonight

show_full_output: true exposed the per-turn tool calls on the successful 100-turn run (#25618096945). The histogram:

Tool Calls
Bash 62
Read 17
Edit 6
Write 2
Task / WebFetch / WebSearch / Grep / Glob 0

Opus never reached for sub-agents or the web. The "denials" were Claude Code's working-directory sandbox blocking commands like:

cp in '/tmp/bot-failure.json' was blocked. For security, Claude Code 
may only copy files to/from the allowed working directories for this 
session: '/home/runner/work/wheels/wheels'.

The cause traced directly to propose-fix.md step 6:

"Capture the failure to /tmp/bot-failure.json (use format=json on the test endpoint or save the bash output)."

Every run, Opus dutifully tried to follow that instruction, got sandboxed, retried, and burned roughly 20 turns on an instruction the runtime cannot satisfy. Three identical 61-turn failures on issues #2526/#2530 (#25616307210, #25616508963, #25617692327) all hit this same wall. The 100-turn bump worked because it gave headroom to absorb the wasted retries.

What's in this PR

.claude/commands/propose-fix.md

Step 6 rewritten:

 6. **Run the failing spec.**

    bash tools/test-local.sh <layer>

-   Capture the failure to `/tmp/bot-failure.json` (use `format=json` on the
-   test endpoint or save the bash output). **Confirm the spec actually
-   fails.** A passing spec at this point means you wrote the wrong test —
-   redo it.
+   Read the bash output directly to confirm the failure — note which
+   assertion fired and the diff between expected and actual. **Do NOT
+   write to `/tmp/` or anywhere outside the repository working tree.**
+   The runtime sandboxes file operations to the working directory; `cp`
+   to `/tmp` (or any out-of-tree path) will be blocked and burn turns on
+   retries. If you need to persist test output for the next step, write
+   to a working-tree path like `./.bot-test-output.txt` and clean it up
+   before commit. **Confirm the spec actually fails.** A passing spec at
+   this point means you wrote the wrong test — redo it.

.github/workflows/bot-propose-fix.yml

Removes the diagnostic show_full_output: true line that was added in PR #2531. Job done — its data informed this PR. max-turns 100 stays.

.github/workflows/bot-review-a.yml

+    # Reviews human PRs that are ready-for-review, AND the bot's own PRs
+    # even while draft. Bot PRs get reviewed pre-merge so the human merge
+    # decision is informed by Reviewer A's analysis (and Reviewer B's
+    # critique) rather than blind. Human drafts are skipped because drafts
+    # are work-in-progress and reviews would churn.
     if: |
       vars.WHEELS_BOT_ENABLED == 'true'
-      && github.event.pull_request.draft == false
-      && github.event.pull_request.user.login != 'wheels-bot[bot]'
+      && (
+        github.event.pull_request.user.login == 'wheels-bot[bot]'
+        || github.event.pull_request.draft == false
+      )

.github/workflows/bot-review-b.yml

Same author-aware condition: critiques Reviewer A's reviews on human ready-for-review PRs and the bot's own PRs (even draft).

docs/contributing/wheels-bot.md

Stage 5 (Reviewer A) and Stage 6 (Reviewer B) descriptions updated to reflect the new bot-PR coverage and document the natural stopping condition.

Why this is the right design

For the prompt fix: structural — the /tmp/ instruction was always going to fail. Removing it unblocks every future propose-fix run, not just this one.

For Reviewer A/B coverage of bot PRs: without it, the human merge decision is blind. The bot is producing a draft PR with implementation + spec + docs commit; the human needs Reviewer A's read on correctness/conventions/cross-engine and Reviewer B's quality check on Reviewer A before deciding to merge. The PR remains --draft throughout (the load-bearing safeguard against any auto-merge), so marking ready stays the explicit human signal.

Stopping condition: existing mechanics handle it. Reviewer B's 3-round cap is the upper bound. The typical bot-PR flow is: PR opened (SHA-A) → Reviewer A reviews SHA-A → bot-update-docs adds docs commit (SHA-B) → Reviewer A reviews SHA-B → Reviewer B critiques each → no further commits → no further reviews. 1-2 rounds in practice.

Test plan

  • Merge to develop.
  • Test the prompt fix — re-dispatch propose-fix on issue #2530 (or a new issue): gh workflow run bot-propose-fix.yml --repo wheels-dev/wheels -F issue-number=2530. Expect: turns well under 100, near-zero permission_denials, no /tmp/ errors in the log.
  • Test the Reviewer A coverage — observe Reviewer A run on PR #2533 once it picks up changes from this PR's merge (or close+reopen fix(view): wire registry-packages list into debug bar Packages tab #2533 to retrigger).
  • Verify bot-tdd-gate.yml still passes on bot PRs (TDD invariant unchanged).
  • Confirm the bot PR remains --draft after the cascade (so the required_approving_review_count: 1 gate continues to require a human ready+approve).

Tonight's session totals (for context)

This PR closes out tonight's investigation. Cumulative spend across all bot runs: roughly $15. The investment surfaced one architectural fix (PR #2528 — propose-fix scope reduction), one Phase 4 lift (PR #2529), one diagnostic instrumentation (PR #2531), and now one prompt-bug fix + reviewer-coverage extension (this PR). The pipeline shipped one real-world bot PR (#2533 — the registry-list fix), which is the proof of life.

🤖 Generated with Claude Code

Two distinct problems landed in one PR because they're best fixed
together: the propose-fix prompt was instructing Opus to write to
/tmp/, which the Claude Code sandbox always blocks; and the reviewer
chain was skipping bot-authored PRs, leaving humans to merge bot
work without analytical context.

## /tmp prompt bug (the real root cause)

Earlier diagnosis blamed the propose-fix max-turns failures on tool
allowlist mismatch. Run #25618096945 (with show_full_output:true)
disproved that — the histogram showed 62 Bash + 17 Read + 6 Edit + 2
Write + ZERO Task/WebFetch/WebSearch. Opus never reached for sub-agents
or the web. The 15-19 "permission denials" per failed run were actually
Claude Code's built-in working-directory sandbox blocking commands like:

  cp '/tmp/bot-failure.json' was blocked. For security, Claude Code
  may only copy files to/from the allowed working directories ...

The instruction came from propose-fix.md step 6:
  > Capture the failure to /tmp/bot-failure.json

Opus dutifully tried to follow it on every run, got blocked, retried
variations, and burned ~20 turns per run on an instruction the runtime
cannot satisfy. Step 6 now tells Opus to read bash output directly and
forbids writes outside the working tree.

## Reviewer A/B coverage of bot PRs

Previously bot-review-a.yml skipped wheels-bot[bot] PRs and
bot-review-b.yml skipped drafts. Net effect: the bot's draft PRs got
zero analytical review, so when a human came to merge they had only
the diff to evaluate. Without Reviewer A's findings or Reviewer B's
critique, the human was deciding blind.

New behavior is author-aware:
  - Bot PR (draft or ready) → Reviewer A runs, Reviewer B follows
  - Human PR (ready) → Reviewer A runs (unchanged)
  - Human PR (draft) → skip (drafts are work-in-progress, unchanged)

The natural stopping condition emerges from existing mechanics: bot's
PR doesn't update its own SHA after bot-update-docs commits, so once
the docs stage settles there are no new SHAs → no new Reviewer A runs
→ no new Reviewer B runs. Reviewer B's 3-round cap is the upper bound.
PR stays --draft throughout; the human marks it ready as the explicit
"I've read the reviews and approve" signal, which is when
required_approving_review_count: 1 begins to apply.

## Diagnostic show_full_output removed

Job done. Removing the flag keeps Actions logs at default size and
avoids the (mild) information disclosure of every per-turn tool call
on a public repo.

max-turns stays at 100. With the prompt bug fixed the original 60
should fit, but headroom is cheap and matches the budget Bun's public
workflows use for similar work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the docs label May 10, 2026
Copy link
Copy Markdown
Contributor

@wheels-bot wheels-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wheels Bot — Reviewer A

TL;DR: This PR contains two separate changes: (1) a clear, correct fix removing the sandboxed /tmp/ instruction from propose-fix.md, and (2) an extension of Reviewer A/B to cover bot-authored draft PRs. The prompt fix is straightforwardly correct. The Reviewer A/B extension has a correctness bug: when Reviewer A runs on a bot-authored PR, the same wheels-bot[bot] App token that created the PR is used to submit the review — GitHub blocks CHANGES_REQUESTED and APPROVED review submissions from PR authors, so any verdict other than COMMENT will fail with a 422. Requesting changes.


Correctness

bot-review-a.yml — Reviewer A self-review 422 failure

.github/workflows/bot-review-a.yml, lines 56–70:

- name: Run Reviewer A
  if: steps.gate.outputs.skip == 'false'
  uses: anthropics/claude-code-action@v1
  with:
    allowed_bots: 'wheels-bot[bot],github-actions[bot]'
    github_token: ${{ steps.app-token.outputs.token }}
    prompt: |
      /review-pr ${{ github.event.pull_request.number }}

When bot-propose-fix.yml opens a PR, the PR author is wheels-bot[bot] (the App installation). Reviewer A now fires on those PRs and submits its review using the same steps.app-token.outputs.token — i.e., also wheels-bot[bot]. GitHub's REST API rejects CHANGES_REQUESTED and APPROVED review submissions from the PR author (API docs): "Submitting the review must be done by a different user than the pull request author." Only COMMENT type reviews are permitted from the author.

The /review-pr command selects --request-changes whenever it finds a Correctness, Cross-engine, or Security finding — exactly what you'd expect when reviewing bot-generated code. That path will produce a gh pr review --request-changes call, which GitHub will reject with 422. The workflow step will then fail, leaving no marker in the PR, so the skip-check won't prevent a repeat attempt on the next synchronize event.

Fix options (pick one):

Option A — Use the GITHUB_TOKEN (github-actions[bot]) for Reviewer A on bot PRs. github-actions[bot] is not the PR author and can submit any verdict.

- name: Determine review token
  id: review-token
  run: |
    if [ "${{ github.event.pull_request.user.login }}" = "wheels-bot[bot]" ]; then
      echo "token=${{ secrets.GITHUB_TOKEN }}" >> $GITHUB_OUTPUT
    else
      echo "token=${{ steps.app-token.outputs.token }}" >> $GITHUB_OUTPUT
    fi
- name: Run Reviewer A
  uses: anthropics/claude-code-action@v1
  with:
    github_token: ${{ steps.review-token.outputs.token }}
    ...

Option B — Force COMMENT verdict for bot PRs inside /review-pr. Add a guard at the top of .claude/commands/review-pr.md step 4: if the PR author is wheels-bot[bot], emit the review as --comment regardless of verdict, noting in the body that --request-changes was downgraded because the reviewer and author are the same App identity.

Option C — Capture the verdict in the idempotency marker before attempting submission. Even if the 422 fires, the review body (COMMENT-type fallback) still lands. This requires changing how /review-pr invokes gh pr review. More invasive.

Option A is the smallest change and correctly separates the concerns. The GITHUB_TOKEN (github-actions[bot]) already has pull-requests: write in this workflow if the permission is added.


Conventions

propose-fix.md step 6 — .bot-test-output.txt escape hatch has no .gitignore guard

.claude/commands/propose-fix.md, new text in step 6:

"If you need to persist test output for the next step, write to a working-tree path like ./.bot-test-output.txt and clean it up before commit."

The instruction to clean up before commit is correct, but there is no enforcement and no .gitignore entry for .bot-test-output.txt. If the bot forgets (e.g., if a later step errors before cleanup), the file ends up in the commit and then the PR diff. The self-check at step 12 doesn't explicitly list "no stray .bot-test-output.txt" as a box to tick. Consider adding .bot-test-output.txt to .gitignore, which gives a hard backstop regardless of prompt discipline.


Tests

No test changes are required for this PR (it touches only workflow YAML, a prompt file, and docs). The test plan in the PR body is honest: all items are manual post-merge verification steps that depend on running the live pipeline. No automated test coverage is feasible here, which is fine.


Docs

docs/contributing/wheels-bot.md update (lines 126–161) is accurate and matches the new workflow behavior. Stage 5 and Stage 6 descriptions correctly describe both the human-PR and bot-PR cases, and the natural stopping condition explanation (one Reviewer A per SHA) is clear.


Commits

Single commit: ci: fix propose-fix /tmp prompt bug; extend Reviewer A/B to bot PRs

  • Type ci — valid.
  • No scope — valid (change spans multiple workflow files).
  • Subject under 100 chars — passes.
  • Not ALL-CAPS — passes.

Commit message is commitlint-clean.


Security

No security concerns. The workflow changes tighten (not loosen) the bot's footprint. The allowed_bots allowlist remains specific (wheels-bot[bot],github-actions[bot]) rather than wildcard.

@bpamiri bpamiri merged commit 109f3f1 into develop May 10, 2026
7 checks passed
@bpamiri bpamiri deleted the peter/bot-prompt-fix-and-bot-pr-reviews branch May 10, 2026 03:29
@wheels-bot
Copy link
Copy Markdown
Contributor

wheels-bot Bot commented May 10, 2026

Wheels Bot — Reviewer B (round 1)

A's review is substantively correct on the main finding and well-structured. One real bug identified, one gap in the proposed fix, and one downstream consequence that deserves explicit mention.

Sycophancy

None detected. A found a real blocking bug and submitted CHANGES_REQUESTED with supporting evidence. No "LGTM" or approval without reasoning.

False positives

None detected. The 422 self-review claim is accurate GitHub API behavior — the API does reject APPROVED and CHANGES_REQUESTED submissions from the PR author, with only COMMENT type permitted. The concern is correctly scoped to the future bot-PR case, not the current PR (which is authored by a human). The .gitignore concern is minor but valid.

Missed issues

Option A requires a permissions change A did not flag. The current bot-review-a.yml has:

permissions:
  contents: read

Option A's GITHUB_TOKEN path needs pull-requests: write to submit any review type. Without it, Option A fails differently — a 403 instead of a 422 — and the symptom is identical (no marker written, no Reviewer B, retry loop). Whoever implements Option A needs to add:

permissions:
  contents: read
  pull-requests: write

A should have read the permissions block before recommending Option A as "the smallest change."

The cascading Reviewer B failure is understated. A noted that a 422 leaves no marker → retry loop on the next synchronize event. But the downstream consequence is starker: Reviewer B only fires when Reviewer A successfully submits a review (pull_request_review: submitted). If Reviewer A consistently fails with 422 on bot PRs, Reviewer B never fires on bot PRs either. The feature's stated goal — "the human merge decision has analysis to lean on" — is entirely defeated. A mentioned the retry loop but didn't connect it to this outcome, so a human reading the review might not appreciate how completely the feature fails rather than partially.

Verdict alignment

CHANGES_REQUESTED is correct. The 422 bug is blocking — the new bot-PR coverage silently fails without fixing it. Consistent with the finding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant