Skip to content

ci: bump Reviewer B max-turns 20 → 30#2524

Merged
bpamiri merged 1 commit intodevelopfrom
claude/bump-reviewer-b-budget
May 9, 2026
Merged

ci: bump Reviewer B max-turns 20 → 30#2524
bpamiri merged 1 commit intodevelopfrom
claude/bump-reviewer-b-budget

Conversation

@bpamiri
Copy link
Copy Markdown
Collaborator

@bpamiri bpamiri commented May 9, 2026

Summary

One-line bump of Reviewer B's --max-turns from 20 → 30.

Why

On PR #2523 (the Astro 5→6 framework upgrade), Reviewer B failed with:

error: Claude Code returned an error result: Reached maximum number of turns (20)

Reviewer A finished cleanly in the same run. The primary review isn't compromised — but the meta-review's max-turns failure shows up as red in the actions tab and prevents the critique from landing.

Why 30 (not 25, not 50)

  • Why not 25? Reviewer A is at 25 turns. Reviewer B's task is strictly more than Reviewer A's (it has to read Reviewer A's review, then read the PR, then cross-reference). It should have headroom over Reviewer A.
  • Why not 50? That's a 2.5× bump and is wasteful for normal-sized PRs. 30 is one standard deviation up from 20, gives ~50% more headroom, and keeps cost growth proportional to the actual issue.
  • Why not just bump on demand? The cost of a budget hit is one missed meta-review plus a red check that demands manual judgment. The cost of an extra turn is fractional cents on Sonnet 4.6. Easy win-vs-cost trade.

Test plan

  • CI passes (no-op for human PR — Bot PR TDD Gate passes)
  • Reviewer B passes on this PR (small diff, will fit comfortably in 30 turns)
  • Future large-PR Reviewer B runs (e.g., dependency bump PRs) don't hit max-turns at 30

Future signal

If we see another max turns failure at 30 on a normal-sized PR, the underlying prompt/tool budget needs a structural fix, not another bump. Tracking this to inform that decision.

🤖 Generated with Claude Code

Reviewer B (the meta-reviewer that critiques Reviewer A for sycophancy,
false positives, and missed issues) hit its 20-turn budget mid-analysis
on PR #2523 — the Astro 5→6 framework upgrade. The PR's diff was
substantial (~770 lockfile lines, multiple package.json updates, an
astro.config.mjs migration) and the bot ran out of turns before
completing its critique.

Reviewer A finished cleanly in the same run, so the primary review
isn't compromised. But Reviewer B's max-turns failure shows up as a
red check in the actions tab and prevents the meta-review from
landing on PRs at the larger end of typical scope.

Bumping to 30 gives ~50% more headroom. Reviewer A is at 25 turns;
keeping Reviewer B above that reflects that meta-review of a complex
PR may need slightly more turns than the original review (load: read
review, read PR, cross-reference, draft critique, iterate).

If 30 also proves tight on truly massive PRs, the next lever is
either splitting Reviewer B into a faster-pass mode or reducing the
prompt's scope on large diffs. For now, this should cover normal-PR
fluctuation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@wheels-bot wheels-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wheels Bot — Reviewer A

TL;DR: Single-line CI config bump increasing Reviewer B's --max-turns from 20 → 30. The change is correct, well-motivated, and the PR body documents the reasoning carefully. No CFML, no security surface, no conventions at stake. Verdict: comment (minor nit only — no blocking issues).


Correctness

The change does exactly what the title claims. One line, no ambiguity.

The relative ordering Reviewer B (30) > Reviewer A (25) is internally consistent: Reviewer B's task (read Reviewer A's output, read the PR, cross-reference both) is strictly a superset of Reviewer A's task. Giving B 20% more headroom than A is proportionate.

The evidence from PR #2523 (a large framework upgrade diff) is concrete — the failure message quoted in the PR body confirms this is a real budget hit, not a speculative increase.

Minor nit — timeout-minutes could be tight at 30 turns

# .github/workflows/bot-review-b.yml, line 11
timeout-minutes: 15

At 30 turns, if tool calls on a large diff average ~30 s each, the job touches the 15-minute wall. For the Astro upgrade PR that triggered this bump, the actual turn count may still land near 30. This isn't a blocker — Claude Code turns in practice average well under 30 s for this kind of task — but it's worth tracking. If a future large-PR run hits the wall instead of the turns budget, bump this to 20 minutes at the same time.


Commits

ci: bump Reviewer B max-turns 20 → 30 — valid type (ci), no scope (correct per CLAUDE.md: .github/workflows/ → type ci, no scope), subject is clear and well under 100 chars.


Docs / Tests

No documentation changes required. The test plan in the PR body is appropriate for a CI config bump: the integration test is "Reviewer B passes on this PR." No unit tests apply here.

@wheels-bot
Copy link
Copy Markdown
Contributor

wheels-bot Bot commented May 9, 2026

Wheels Bot — Reviewer B (round 1)

A's review is accurate and proportionate for a single-line CI config change. The reasoning about turn budget, relative ordering (B at 30 > A at 25), and commit validity are all correct. One minor citation error is worth flagging.

Sycophancy

None detected. A's positive assessment is evidence-backed (quoted failure from PR #2523, internal consistency argument). Verdict is "comment" rather than "approve" — appropriate for a CI workflow change.

False positives

  • A cites timeout-minutes: 15 as "line 11" in the workflow file. The actual location is line 18; line 11 is the group: key under concurrency:. The underlying observation (15-minute wall could be tight at 30 turns) is valid, but the line reference is wrong. A reader following that citation would land on the wrong block.

Missed issues

None detected. The diff is a single line. No CFML, no security surface, no migrations, no docs impact.

Verdict alignment

"Comment" with a minor nit and no blocking issues is the correct verdict for a one-line CI config bump.

@bpamiri bpamiri merged commit b9293e2 into develop May 9, 2026
6 checks passed
@bpamiri bpamiri deleted the claude/bump-reviewer-b-budget branch May 9, 2026 22:25
bpamiri added a commit that referenced this pull request May 10, 2026
…(diagnostic) (#2531)

* ci: bump propose-fix max-turns 60 → 100 (diagnostic)

Three propose-fix attempts on the registry-list bug have hit the
60-turn ceiling identically:

| Run                | Turns | Cost    | Denials |
|--------------------|-------|---------|---------|
| #25616307210       | 61    | $3.66   | 15      |
| #25616508963       | 61    | $3.36   | 19      |
| #25617692327       | 61    | $3.63   | 18      |

The PR #2528 step-9 scope cut (4 doc locations → 1 CHANGELOG entry)
did not move propose-fix off the cliff — the doc updates were not
the dominant scope sink. The persistent 15-19 permission denials per
run suggest Opus is wasting ~30% of the budget on tool calls outside
the allowlist.

This is a diagnostic bump: lift the ceiling so a real-world propose-fix
can complete, then observe how many turns it actually takes. The
distance between the new ceiling and the actual run length tells us
whether the structural problem is budget (gap is small, ceiling was
just too tight) or prompt-vs-allowlist mismatch (gap is large, denials
are the real waste).

Comparable to PR #2524's Reviewer B 20 → 30 bump pattern. If the
follow-up evidence shows ~30% of turns going to denials, a future PR
will tighten the prompt or expand the allowlist accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(diagnostic): show_full_output on propose-fix to expose denied tools

The 60→100 turn bump unblocks the cliff but doesn't tell us why ~30%
of every run goes to permission denials. The action's documentation
explicitly points to show_full_output: true as the way to expose
per-turn tool calls including which were rejected.

Adding the flag temporarily so the next propose-fix run logs every
tool Opus tried plus its outcome. Once we know which tools are
denied (likely candidates: Task, WebFetch, broader Bash commands),
a follow-up PR removes this flag and either expands the allowlist
or tightens the prompt to stop reaching for them.

Security note: show_full_output exposes tool-call detail to the
public Actions log. The action redacts secrets (anthropic_api_key,
github_token) at a separate layer regardless of this flag, so
credentials don't leak. Tool names and file paths Opus reads will
be visible — that's the diagnostic content we want.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bpamiri added a commit that referenced this pull request May 10, 2026
The asymmetry between failure modes argues for big headroom, not
tightly tuned caps:

- Hitting the cap = SDK terminates with error_max_turns, WIP commits
  may land but no PR opens, no comment posted, marker state needs
  manual unwinding before re-dispatch. Recovery cost includes the full
  failed-run spend + cleanup time + 2nd-run spend.
- Going long = bot completes its work, PR opens, work is preserved.
  Cost is just the extra turns of model time.

Both failure modes have similar dollar cost per turn. Only one preserves
work. The kill switch (WHEELS_BOT_ENABLED=false) is the actual safety
net for genuine runaway processes — the per-stage turn cap should be
sized for "the longest legitimate run", not "average plus margin".

Empirical evidence from tonight:

  bot-triage:      observed 10-15 turns successful (cap was 20, 33% margin)
  bot-write-docs:  observed 31-32 turns (cap was 30 — 1-2 turns over the
                    cliff; one run "succeeded" by SDK overshoot, one
                    failed at error_max_turns one turn earlier)
  bot-propose-fix: observed ~75 turns successful after the /tmp prompt
                    fix (cap was 100, 33% margin)
  bot-research:    untested in production yet
  bot-review-a:    observed 10-20 turns (cap was 25, ~25% margin)
  bot-review-b:    bumped from 20 to 30 in PR #2524 already

Apply ~2x the observed-typical bump across the board:

  bot-triage:       20 →  40   (Sonnet)
  bot-research:     40 →  80   (Opus)
  bot-propose-fix: 100 → 150   (Opus)
  bot-write-docs:   30 → 100   (Sonnet)
  bot-review-a:     25 →  60   (Sonnet)
  bot-review-b:     30 →  60   (Sonnet)

Worst-case incremental cost if every stage on a single issue hits the
new cap: ~$13 over the previous worst-case (~$15 → ~$28). Realistic
case: most stages use the same number of turns they've been using;
the bumps are pure headroom against stochastic variation.

Subsumes the original 30 → 50 single-stage write-docs bump that was
the initial scope of this PR. The audit-wide bumps replace it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant