ci: fix propose-fix /tmp prompt bug; extend Reviewer A/B to bot PRs#2534
Conversation
Two distinct problems landed in one PR because they're best fixed together: the propose-fix prompt was instructing Opus to write to /tmp/, which the Claude Code sandbox always blocks; and the reviewer chain was skipping bot-authored PRs, leaving humans to merge bot work without analytical context. ## /tmp prompt bug (the real root cause) Earlier diagnosis blamed the propose-fix max-turns failures on tool allowlist mismatch. Run #25618096945 (with show_full_output:true) disproved that — the histogram showed 62 Bash + 17 Read + 6 Edit + 2 Write + ZERO Task/WebFetch/WebSearch. Opus never reached for sub-agents or the web. The 15-19 "permission denials" per failed run were actually Claude Code's built-in working-directory sandbox blocking commands like: cp '/tmp/bot-failure.json' was blocked. For security, Claude Code may only copy files to/from the allowed working directories ... The instruction came from propose-fix.md step 6: > Capture the failure to /tmp/bot-failure.json Opus dutifully tried to follow it on every run, got blocked, retried variations, and burned ~20 turns per run on an instruction the runtime cannot satisfy. Step 6 now tells Opus to read bash output directly and forbids writes outside the working tree. ## Reviewer A/B coverage of bot PRs Previously bot-review-a.yml skipped wheels-bot[bot] PRs and bot-review-b.yml skipped drafts. Net effect: the bot's draft PRs got zero analytical review, so when a human came to merge they had only the diff to evaluate. Without Reviewer A's findings or Reviewer B's critique, the human was deciding blind. New behavior is author-aware: - Bot PR (draft or ready) → Reviewer A runs, Reviewer B follows - Human PR (ready) → Reviewer A runs (unchanged) - Human PR (draft) → skip (drafts are work-in-progress, unchanged) The natural stopping condition emerges from existing mechanics: bot's PR doesn't update its own SHA after bot-update-docs commits, so once the docs stage settles there are no new SHAs → no new Reviewer A runs → no new Reviewer B runs. Reviewer B's 3-round cap is the upper bound. PR stays --draft throughout; the human marks it ready as the explicit "I've read the reviews and approve" signal, which is when required_approving_review_count: 1 begins to apply. ## Diagnostic show_full_output removed Job done. Removing the flag keeps Actions logs at default size and avoids the (mild) information disclosure of every per-turn tool call on a public repo. max-turns stays at 100. With the prompt bug fixed the original 60 should fit, but headroom is cheap and matches the budget Bun's public workflows use for similar work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Wheels Bot — Reviewer A
TL;DR: This PR contains two separate changes: (1) a clear, correct fix removing the sandboxed /tmp/ instruction from propose-fix.md, and (2) an extension of Reviewer A/B to cover bot-authored draft PRs. The prompt fix is straightforwardly correct. The Reviewer A/B extension has a correctness bug: when Reviewer A runs on a bot-authored PR, the same wheels-bot[bot] App token that created the PR is used to submit the review — GitHub blocks CHANGES_REQUESTED and APPROVED review submissions from PR authors, so any verdict other than COMMENT will fail with a 422. Requesting changes.
Correctness
bot-review-a.yml — Reviewer A self-review 422 failure
.github/workflows/bot-review-a.yml, lines 56–70:
- name: Run Reviewer A
if: steps.gate.outputs.skip == 'false'
uses: anthropics/claude-code-action@v1
with:
allowed_bots: 'wheels-bot[bot],github-actions[bot]'
github_token: ${{ steps.app-token.outputs.token }}
prompt: |
/review-pr ${{ github.event.pull_request.number }}When bot-propose-fix.yml opens a PR, the PR author is wheels-bot[bot] (the App installation). Reviewer A now fires on those PRs and submits its review using the same steps.app-token.outputs.token — i.e., also wheels-bot[bot]. GitHub's REST API rejects CHANGES_REQUESTED and APPROVED review submissions from the PR author (API docs): "Submitting the review must be done by a different user than the pull request author." Only COMMENT type reviews are permitted from the author.
The /review-pr command selects --request-changes whenever it finds a Correctness, Cross-engine, or Security finding — exactly what you'd expect when reviewing bot-generated code. That path will produce a gh pr review --request-changes call, which GitHub will reject with 422. The workflow step will then fail, leaving no marker in the PR, so the skip-check won't prevent a repeat attempt on the next synchronize event.
Fix options (pick one):
Option A — Use the GITHUB_TOKEN (github-actions[bot]) for Reviewer A on bot PRs. github-actions[bot] is not the PR author and can submit any verdict.
- name: Determine review token
id: review-token
run: |
if [ "${{ github.event.pull_request.user.login }}" = "wheels-bot[bot]" ]; then
echo "token=${{ secrets.GITHUB_TOKEN }}" >> $GITHUB_OUTPUT
else
echo "token=${{ steps.app-token.outputs.token }}" >> $GITHUB_OUTPUT
fi
- name: Run Reviewer A
uses: anthropics/claude-code-action@v1
with:
github_token: ${{ steps.review-token.outputs.token }}
...Option B — Force COMMENT verdict for bot PRs inside /review-pr. Add a guard at the top of .claude/commands/review-pr.md step 4: if the PR author is wheels-bot[bot], emit the review as --comment regardless of verdict, noting in the body that --request-changes was downgraded because the reviewer and author are the same App identity.
Option C — Capture the verdict in the idempotency marker before attempting submission. Even if the 422 fires, the review body (COMMENT-type fallback) still lands. This requires changing how /review-pr invokes gh pr review. More invasive.
Option A is the smallest change and correctly separates the concerns. The GITHUB_TOKEN (github-actions[bot]) already has pull-requests: write in this workflow if the permission is added.
Conventions
propose-fix.md step 6 — .bot-test-output.txt escape hatch has no .gitignore guard
.claude/commands/propose-fix.md, new text in step 6:
"If you need to persist test output for the next step, write to a working-tree path like
./.bot-test-output.txtand clean it up before commit."
The instruction to clean up before commit is correct, but there is no enforcement and no .gitignore entry for .bot-test-output.txt. If the bot forgets (e.g., if a later step errors before cleanup), the file ends up in the commit and then the PR diff. The self-check at step 12 doesn't explicitly list "no stray .bot-test-output.txt" as a box to tick. Consider adding .bot-test-output.txt to .gitignore, which gives a hard backstop regardless of prompt discipline.
Tests
No test changes are required for this PR (it touches only workflow YAML, a prompt file, and docs). The test plan in the PR body is honest: all items are manual post-merge verification steps that depend on running the live pipeline. No automated test coverage is feasible here, which is fine.
Docs
docs/contributing/wheels-bot.md update (lines 126–161) is accurate and matches the new workflow behavior. Stage 5 and Stage 6 descriptions correctly describe both the human-PR and bot-PR cases, and the natural stopping condition explanation (one Reviewer A per SHA) is clear.
Commits
Single commit: ci: fix propose-fix /tmp prompt bug; extend Reviewer A/B to bot PRs
- Type
ci— valid. - No scope — valid (change spans multiple workflow files).
- Subject under 100 chars — passes.
- Not ALL-CAPS — passes.
Commit message is commitlint-clean.
Security
No security concerns. The workflow changes tighten (not loosen) the bot's footprint. The allowed_bots allowlist remains specific (wheels-bot[bot],github-actions[bot]) rather than wildcard.
Wheels Bot — Reviewer B (round 1)A's review is substantively correct on the main finding and well-structured. One real bug identified, one gap in the proposed fix, and one downstream consequence that deserves explicit mention. SycophancyNone detected. A found a real blocking bug and submitted CHANGES_REQUESTED with supporting evidence. No "LGTM" or approval without reasoning. False positivesNone detected. The 422 self-review claim is accurate GitHub API behavior — the API does reject APPROVED and CHANGES_REQUESTED submissions from the PR author, with only COMMENT type permitted. The concern is correctly scoped to the future bot-PR case, not the current PR (which is authored by a human). The Missed issuesOption A requires a permissions change A did not flag. The current permissions:
contents: readOption A's GITHUB_TOKEN path needs permissions:
contents: read
pull-requests: writeA should have read the permissions block before recommending Option A as "the smallest change." The cascading Reviewer B failure is understated. A noted that a 422 leaves no marker → retry loop on the next synchronize event. But the downstream consequence is starker: Reviewer B only fires when Reviewer A successfully submits a review ( Verdict alignmentCHANGES_REQUESTED is correct. The 422 bug is blocking — the new bot-PR coverage silently fails without fixing it. Consistent with the finding. |
Summary
Two related fixes from tonight's pipeline-test session:
/tmp/prompt bug. Tonight's data finally identified what was actually causing the 15-19 "permission denials" per propose-fix run that we kept attributing to tool allowlist mismatch.What we learned tonight
show_full_output: trueexposed the per-turn tool calls on the successful 100-turn run (#25618096945). The histogram:Opus never reached for sub-agents or the web. The "denials" were Claude Code's working-directory sandbox blocking commands like:
The cause traced directly to
propose-fix.mdstep 6:Every run, Opus dutifully tried to follow that instruction, got sandboxed, retried, and burned roughly 20 turns on an instruction the runtime cannot satisfy. Three identical 61-turn failures on issues #2526/#2530 (#25616307210, #25616508963, #25617692327) all hit this same wall. The 100-turn bump worked because it gave headroom to absorb the wasted retries.
What's in this PR
.claude/commands/propose-fix.mdStep 6 rewritten:
6. **Run the failing spec.** bash tools/test-local.sh <layer> - Capture the failure to `/tmp/bot-failure.json` (use `format=json` on the - test endpoint or save the bash output). **Confirm the spec actually - fails.** A passing spec at this point means you wrote the wrong test — - redo it. + Read the bash output directly to confirm the failure — note which + assertion fired and the diff between expected and actual. **Do NOT + write to `/tmp/` or anywhere outside the repository working tree.** + The runtime sandboxes file operations to the working directory; `cp` + to `/tmp` (or any out-of-tree path) will be blocked and burn turns on + retries. If you need to persist test output for the next step, write + to a working-tree path like `./.bot-test-output.txt` and clean it up + before commit. **Confirm the spec actually fails.** A passing spec at + this point means you wrote the wrong test — redo it..github/workflows/bot-propose-fix.ymlRemoves the diagnostic
show_full_output: trueline that was added in PR #2531. Job done — its data informed this PR.max-turns 100stays..github/workflows/bot-review-a.yml.github/workflows/bot-review-b.ymlSame author-aware condition: critiques Reviewer A's reviews on human ready-for-review PRs and the bot's own PRs (even draft).
docs/contributing/wheels-bot.mdStage 5 (Reviewer A) and Stage 6 (Reviewer B) descriptions updated to reflect the new bot-PR coverage and document the natural stopping condition.
Why this is the right design
For the prompt fix: structural — the
/tmp/instruction was always going to fail. Removing it unblocks every future propose-fix run, not just this one.For Reviewer A/B coverage of bot PRs: without it, the human merge decision is blind. The bot is producing a draft PR with implementation + spec + docs commit; the human needs Reviewer A's read on correctness/conventions/cross-engine and Reviewer B's quality check on Reviewer A before deciding to merge. The PR remains
--draftthroughout (the load-bearing safeguard against any auto-merge), so marking ready stays the explicit human signal.Stopping condition: existing mechanics handle it. Reviewer B's 3-round cap is the upper bound. The typical bot-PR flow is: PR opened (SHA-A) → Reviewer A reviews SHA-A → bot-update-docs adds docs commit (SHA-B) → Reviewer A reviews SHA-B → Reviewer B critiques each → no further commits → no further reviews. 1-2 rounds in practice.
Test plan
gh workflow run bot-propose-fix.yml --repo wheels-dev/wheels -F issue-number=2530. Expect: turns well under 100, near-zero permission_denials, no/tmp/errors in the log.bot-tdd-gate.ymlstill passes on bot PRs (TDD invariant unchanged).--draftafter the cascade (so therequired_approving_review_count: 1gate continues to require a human ready+approve).Tonight's session totals (for context)
This PR closes out tonight's investigation. Cumulative spend across all bot runs: roughly $15. The investment surfaced one architectural fix (PR #2528 — propose-fix scope reduction), one Phase 4 lift (PR #2529), one diagnostic instrumentation (PR #2531), and now one prompt-bug fix + reviewer-coverage extension (this PR). The pipeline shipped one real-world bot PR (#2533 — the registry-list fix), which is the proof of life.
🤖 Generated with Claude Code