Conversation
Reviewer B (the meta-reviewer that critiques Reviewer A for sycophancy, false positives, and missed issues) hit its 20-turn budget mid-analysis on PR #2523 — the Astro 5→6 framework upgrade. The PR's diff was substantial (~770 lockfile lines, multiple package.json updates, an astro.config.mjs migration) and the bot ran out of turns before completing its critique. Reviewer A finished cleanly in the same run, so the primary review isn't compromised. But Reviewer B's max-turns failure shows up as a red check in the actions tab and prevents the meta-review from landing on PRs at the larger end of typical scope. Bumping to 30 gives ~50% more headroom. Reviewer A is at 25 turns; keeping Reviewer B above that reflects that meta-review of a complex PR may need slightly more turns than the original review (load: read review, read PR, cross-reference, draft critique, iterate). If 30 also proves tight on truly massive PRs, the next lever is either splitting Reviewer B into a faster-pass mode or reducing the prompt's scope on large diffs. For now, this should cover normal-PR fluctuation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Wheels Bot — Reviewer A
TL;DR: Single-line CI config bump increasing Reviewer B's --max-turns from 20 → 30. The change is correct, well-motivated, and the PR body documents the reasoning carefully. No CFML, no security surface, no conventions at stake. Verdict: comment (minor nit only — no blocking issues).
Correctness
The change does exactly what the title claims. One line, no ambiguity.
The relative ordering Reviewer B (30) > Reviewer A (25) is internally consistent: Reviewer B's task (read Reviewer A's output, read the PR, cross-reference both) is strictly a superset of Reviewer A's task. Giving B 20% more headroom than A is proportionate.
The evidence from PR #2523 (a large framework upgrade diff) is concrete — the failure message quoted in the PR body confirms this is a real budget hit, not a speculative increase.
Minor nit — timeout-minutes could be tight at 30 turns
# .github/workflows/bot-review-b.yml, line 11
timeout-minutes: 15At 30 turns, if tool calls on a large diff average ~30 s each, the job touches the 15-minute wall. For the Astro upgrade PR that triggered this bump, the actual turn count may still land near 30. This isn't a blocker — Claude Code turns in practice average well under 30 s for this kind of task — but it's worth tracking. If a future large-PR run hits the wall instead of the turns budget, bump this to 20 minutes at the same time.
Commits
ci: bump Reviewer B max-turns 20 → 30 — valid type (ci), no scope (correct per CLAUDE.md: .github/workflows/ → type ci, no scope), subject is clear and well under 100 chars.
Docs / Tests
No documentation changes required. The test plan in the PR body is appropriate for a CI config bump: the integration test is "Reviewer B passes on this PR." No unit tests apply here.
Wheels Bot — Reviewer B (round 1)A's review is accurate and proportionate for a single-line CI config change. The reasoning about turn budget, relative ordering (B at 30 > A at 25), and commit validity are all correct. One minor citation error is worth flagging. SycophancyNone detected. A's positive assessment is evidence-backed (quoted failure from PR #2523, internal consistency argument). Verdict is "comment" rather than "approve" — appropriate for a CI workflow change. False positives
Missed issuesNone detected. The diff is a single line. No CFML, no security surface, no migrations, no docs impact. Verdict alignment"Comment" with a minor nit and no blocking issues is the correct verdict for a one-line CI config bump. |
…(diagnostic) (#2531) * ci: bump propose-fix max-turns 60 → 100 (diagnostic) Three propose-fix attempts on the registry-list bug have hit the 60-turn ceiling identically: | Run | Turns | Cost | Denials | |--------------------|-------|---------|---------| | #25616307210 | 61 | $3.66 | 15 | | #25616508963 | 61 | $3.36 | 19 | | #25617692327 | 61 | $3.63 | 18 | The PR #2528 step-9 scope cut (4 doc locations → 1 CHANGELOG entry) did not move propose-fix off the cliff — the doc updates were not the dominant scope sink. The persistent 15-19 permission denials per run suggest Opus is wasting ~30% of the budget on tool calls outside the allowlist. This is a diagnostic bump: lift the ceiling so a real-world propose-fix can complete, then observe how many turns it actually takes. The distance between the new ceiling and the actual run length tells us whether the structural problem is budget (gap is small, ceiling was just too tight) or prompt-vs-allowlist mismatch (gap is large, denials are the real waste). Comparable to PR #2524's Reviewer B 20 → 30 bump pattern. If the follow-up evidence shows ~30% of turns going to denials, a future PR will tighten the prompt or expand the allowlist accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(diagnostic): show_full_output on propose-fix to expose denied tools The 60→100 turn bump unblocks the cliff but doesn't tell us why ~30% of every run goes to permission denials. The action's documentation explicitly points to show_full_output: true as the way to expose per-turn tool calls including which were rejected. Adding the flag temporarily so the next propose-fix run logs every tool Opus tried plus its outcome. Once we know which tools are denied (likely candidates: Task, WebFetch, broader Bash commands), a follow-up PR removes this flag and either expands the allowlist or tightens the prompt to stop reaching for them. Security note: show_full_output exposes tool-call detail to the public Actions log. The action redacts secrets (anthropic_api_key, github_token) at a separate layer regardless of this flag, so credentials don't leak. Tool names and file paths Opus reads will be visible — that's the diagnostic content we want. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The asymmetry between failure modes argues for big headroom, not
tightly tuned caps:
- Hitting the cap = SDK terminates with error_max_turns, WIP commits
may land but no PR opens, no comment posted, marker state needs
manual unwinding before re-dispatch. Recovery cost includes the full
failed-run spend + cleanup time + 2nd-run spend.
- Going long = bot completes its work, PR opens, work is preserved.
Cost is just the extra turns of model time.
Both failure modes have similar dollar cost per turn. Only one preserves
work. The kill switch (WHEELS_BOT_ENABLED=false) is the actual safety
net for genuine runaway processes — the per-stage turn cap should be
sized for "the longest legitimate run", not "average plus margin".
Empirical evidence from tonight:
bot-triage: observed 10-15 turns successful (cap was 20, 33% margin)
bot-write-docs: observed 31-32 turns (cap was 30 — 1-2 turns over the
cliff; one run "succeeded" by SDK overshoot, one
failed at error_max_turns one turn earlier)
bot-propose-fix: observed ~75 turns successful after the /tmp prompt
fix (cap was 100, 33% margin)
bot-research: untested in production yet
bot-review-a: observed 10-20 turns (cap was 25, ~25% margin)
bot-review-b: bumped from 20 to 30 in PR #2524 already
Apply ~2x the observed-typical bump across the board:
bot-triage: 20 → 40 (Sonnet)
bot-research: 40 → 80 (Opus)
bot-propose-fix: 100 → 150 (Opus)
bot-write-docs: 30 → 100 (Sonnet)
bot-review-a: 25 → 60 (Sonnet)
bot-review-b: 30 → 60 (Sonnet)
Worst-case incremental cost if every stage on a single issue hits the
new cap: ~$13 over the previous worst-case (~$15 → ~$28). Realistic
case: most stages use the same number of turns they've been using;
the bumps are pure headroom against stochastic variation.
Subsumes the original 30 → 50 single-stage write-docs bump that was
the initial scope of this PR. The audit-wide bumps replace it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
One-line bump of Reviewer B's
--max-turnsfrom 20 → 30.Why
On PR #2523 (the Astro 5→6 framework upgrade), Reviewer B failed with:
Reviewer A finished cleanly in the same run. The primary review isn't compromised — but the meta-review's max-turns failure shows up as red in the actions tab and prevents the critique from landing.
Why 30 (not 25, not 50)
Test plan
Bot PR TDD Gatepasses)Future signal
If we see another
max turnsfailure at 30 on a normal-sized PR, the underlying prompt/tool budget needs a structural fix, not another bump. Tracking this to inform that decision.🤖 Generated with Claude Code