ci: bump Reviewer B max-turns 20 → 30 by bpamiri · Pull Request #2524 · wheels-dev/wheels

bpamiri · 2026-05-09T22:21:39Z

Summary

One-line bump of Reviewer B's --max-turns from 20 → 30.

Why

On PR #2523 (the Astro 5→6 framework upgrade), Reviewer B failed with:

error: Claude Code returned an error result: Reached maximum number of turns (20)

Reviewer A finished cleanly in the same run. The primary review isn't compromised — but the meta-review's max-turns failure shows up as red in the actions tab and prevents the critique from landing.

Why 30 (not 25, not 50)

Why not 25? Reviewer A is at 25 turns. Reviewer B's task is strictly more than Reviewer A's (it has to read Reviewer A's review, then read the PR, then cross-reference). It should have headroom over Reviewer A.
Why not 50? That's a 2.5× bump and is wasteful for normal-sized PRs. 30 is one standard deviation up from 20, gives ~50% more headroom, and keeps cost growth proportional to the actual issue.
Why not just bump on demand? The cost of a budget hit is one missed meta-review plus a red check that demands manual judgment. The cost of an extra turn is fractional cents on Sonnet 4.6. Easy win-vs-cost trade.

Test plan

CI passes (no-op for human PR — Bot PR TDD Gate passes)
Reviewer B passes on this PR (small diff, will fit comfortably in 30 turns)
Future large-PR Reviewer B runs (e.g., dependency bump PRs) don't hit max-turns at 30

Future signal

If we see another max turns failure at 30 on a normal-sized PR, the underlying prompt/tool budget needs a structural fix, not another bump. Tracking this to inform that decision.

🤖 Generated with Claude Code

Reviewer B (the meta-reviewer that critiques Reviewer A for sycophancy, false positives, and missed issues) hit its 20-turn budget mid-analysis on PR #2523 — the Astro 5→6 framework upgrade. The PR's diff was substantial (~770 lockfile lines, multiple package.json updates, an astro.config.mjs migration) and the bot ran out of turns before completing its critique. Reviewer A finished cleanly in the same run, so the primary review isn't compromised. But Reviewer B's max-turns failure shows up as a red check in the actions tab and prevents the meta-review from landing on PRs at the larger end of typical scope. Bumping to 30 gives ~50% more headroom. Reviewer A is at 25 turns; keeping Reviewer B above that reflects that meta-review of a complex PR may need slightly more turns than the original review (load: read review, read PR, cross-reference, draft critique, iterate). If 30 also proves tight on truly massive PRs, the next lever is either splitting Reviewer B into a faster-pass mode or reducing the prompt's scope on large diffs. For now, this should cover normal-PR fluctuation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wheels-bot

Wheels Bot — Reviewer A

TL;DR: Single-line CI config bump increasing Reviewer B's --max-turns from 20 → 30. The change is correct, well-motivated, and the PR body documents the reasoning carefully. No CFML, no security surface, no conventions at stake. Verdict: comment (minor nit only — no blocking issues).

Correctness

The change does exactly what the title claims. One line, no ambiguity.

The relative ordering Reviewer B (30) > Reviewer A (25) is internally consistent: Reviewer B's task (read Reviewer A's output, read the PR, cross-reference both) is strictly a superset of Reviewer A's task. Giving B 20% more headroom than A is proportionate.

The evidence from PR #2523 (a large framework upgrade diff) is concrete — the failure message quoted in the PR body confirms this is a real budget hit, not a speculative increase.

Minor nit — timeout-minutes could be tight at 30 turns

# .github/workflows/bot-review-b.yml, line 11
timeout-minutes: 15

At 30 turns, if tool calls on a large diff average ~30 s each, the job touches the 15-minute wall. For the Astro upgrade PR that triggered this bump, the actual turn count may still land near 30. This isn't a blocker — Claude Code turns in practice average well under 30 s for this kind of task — but it's worth tracking. If a future large-PR run hits the wall instead of the turns budget, bump this to 20 minutes at the same time.

Commits

ci: bump Reviewer B max-turns 20 → 30 — valid type (ci), no scope (correct per CLAUDE.md: .github/workflows/ → type ci, no scope), subject is clear and well under 100 chars.

Docs / Tests

No documentation changes required. The test plan in the PR body is appropriate for a CI config bump: the integration test is "Reviewer B passes on this PR." No unit tests apply here.

wheels-bot · 2026-05-09T22:24:31Z

Wheels Bot — Reviewer B (round 1)

A's review is accurate and proportionate for a single-line CI config change. The reasoning about turn budget, relative ordering (B at 30 > A at 25), and commit validity are all correct. One minor citation error is worth flagging.

Sycophancy

None detected. A's positive assessment is evidence-backed (quoted failure from PR #2523, internal consistency argument). Verdict is "comment" rather than "approve" — appropriate for a CI workflow change.

False positives

A cites timeout-minutes: 15 as "line 11" in the workflow file. The actual location is line 18; line 11 is the group: key under concurrency:. The underlying observation (15-minute wall could be tight at 30 turns) is valid, but the line reference is wrong. A reader following that citation would land on the wrong block.

Missed issues

None detected. The diff is a single line. No CFML, no security surface, no migrations, no docs impact.

Verdict alignment

"Comment" with a minor nit and no blocking issues is the correct verdict for a one-line CI config bump.

…(diagnostic) (#2531) * ci: bump propose-fix max-turns 60 → 100 (diagnostic) Three propose-fix attempts on the registry-list bug have hit the 60-turn ceiling identically: | Run | Turns | Cost | Denials | |--------------------|-------|---------|---------| | #25616307210 | 61 | $3.66 | 15 | | #25616508963 | 61 | $3.36 | 19 | | #25617692327 | 61 | $3.63 | 18 | The PR #2528 step-9 scope cut (4 doc locations → 1 CHANGELOG entry) did not move propose-fix off the cliff — the doc updates were not the dominant scope sink. The persistent 15-19 permission denials per run suggest Opus is wasting ~30% of the budget on tool calls outside the allowlist. This is a diagnostic bump: lift the ceiling so a real-world propose-fix can complete, then observe how many turns it actually takes. The distance between the new ceiling and the actual run length tells us whether the structural problem is budget (gap is small, ceiling was just too tight) or prompt-vs-allowlist mismatch (gap is large, denials are the real waste). Comparable to PR #2524's Reviewer B 20 → 30 bump pattern. If the follow-up evidence shows ~30% of turns going to denials, a future PR will tighten the prompt or expand the allowlist accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(diagnostic): show_full_output on propose-fix to expose denied tools The 60→100 turn bump unblocks the cliff but doesn't tell us why ~30% of every run goes to permission denials. The action's documentation explicitly points to show_full_output: true as the way to expose per-turn tool calls including which were rejected. Adding the flag temporarily so the next propose-fix run logs every tool Opus tried plus its outcome. Once we know which tools are denied (likely candidates: Task, WebFetch, broader Bash commands), a follow-up PR removes this flag and either expands the allowlist or tightens the prompt to stop reaching for them. Security note: show_full_output exposes tool-call detail to the public Actions log. The action redacts secrets (anthropic_api_key, github_token) at a separate layer regardless of this flag, so credentials don't leak. Tool names and file paths Opus reads will be visible — that's the diagnostic content we want. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The asymmetry between failure modes argues for big headroom, not tightly tuned caps: - Hitting the cap = SDK terminates with error_max_turns, WIP commits may land but no PR opens, no comment posted, marker state needs manual unwinding before re-dispatch. Recovery cost includes the full failed-run spend + cleanup time + 2nd-run spend. - Going long = bot completes its work, PR opens, work is preserved. Cost is just the extra turns of model time. Both failure modes have similar dollar cost per turn. Only one preserves work. The kill switch (WHEELS_BOT_ENABLED=false) is the actual safety net for genuine runaway processes — the per-stage turn cap should be sized for "the longest legitimate run", not "average plus margin". Empirical evidence from tonight: bot-triage: observed 10-15 turns successful (cap was 20, 33% margin) bot-write-docs: observed 31-32 turns (cap was 30 — 1-2 turns over the cliff; one run "succeeded" by SDK overshoot, one failed at error_max_turns one turn earlier) bot-propose-fix: observed ~75 turns successful after the /tmp prompt fix (cap was 100, 33% margin) bot-research: untested in production yet bot-review-a: observed 10-20 turns (cap was 25, ~25% margin) bot-review-b: bumped from 20 to 30 in PR #2524 already Apply ~2x the observed-typical bump across the board: bot-triage: 20 → 40 (Sonnet) bot-research: 40 → 80 (Opus) bot-propose-fix: 100 → 150 (Opus) bot-write-docs: 30 → 100 (Sonnet) bot-review-a: 25 → 60 (Sonnet) bot-review-b: 30 → 60 (Sonnet) Worst-case incremental cost if every stage on a single issue hits the new cap: ~$13 over the previous worst-case (~$15 → ~$28). Realistic case: most stages use the same number of turns they've been using; the bumps are pure headroom against stochastic variation. Subsumes the original 30 → 50 single-stage write-docs bump that was the initial scope of this PR. The audit-wide bumps replace it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wheels-bot Bot reviewed May 9, 2026

View reviewed changes

bpamiri merged commit b9293e2 into develop May 9, 2026
6 checks passed

bpamiri deleted the claude/bump-reviewer-b-budget branch May 9, 2026 22:25

bpamiri mentioned this pull request May 10, 2026

ci: bump propose-fix max-turns 60 → 100 + temporary show_full_output (diagnostic) #2531

Merged

5 tasks

bpamiri mentioned this pull request May 10, 2026

ci: lift bot stage turn caps to ~10x — let timeout-minutes be the ceiling #2538

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: bump Reviewer B max-turns 20 → 30#2524

ci: bump Reviewer B max-turns 20 → 30#2524
bpamiri merged 1 commit intodevelopfrom
claude/bump-reviewer-b-budget

bpamiri commented May 9, 2026

Uh oh!

wheels-bot Bot left a comment

Uh oh!

wheels-bot Bot commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bpamiri commented May 9, 2026

Summary

Why

Why 30 (not 25, not 50)

Test plan

Future signal

Uh oh!

wheels-bot Bot left a comment

Choose a reason for hiding this comment

Wheels Bot — Reviewer A

Correctness

Commits

Docs / Tests

Uh oh!

wheels-bot Bot commented May 9, 2026

Wheels Bot — Reviewer B (round 1)

Sycophancy

False positives

Missed issues

Verdict alignment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant