Skip to content

ci: lift bot stage turn caps to ~10x — let timeout-minutes be the ceiling#2538

Merged
bpamiri merged 1 commit intodevelopfrom
peter/bot-write-docs-bump-turns
May 10, 2026
Merged

ci: lift bot stage turn caps to ~10x — let timeout-minutes be the ceiling#2538
bpamiri merged 1 commit intodevelopfrom
peter/bot-write-docs-bump-turns

Conversation

@bpamiri
Copy link
Copy Markdown
Collaborator

@bpamiri bpamiri commented May 10, 2026

Summary

Lifts --max-turns across all six bot stages to ~10× the originally-shipped values. Originally scoped as a 30 → 50 single-stage fix for bot-write-docs.yml, then expanded to 2× across the board, then expanded again to ~10× after we recognized:

Turn caps that compete with timeout-minutes just create a redundant cliff. Misjudging where the cliff sits costs a full failed run plus manual cleanup plus a re-dispatch. The timeout is the real ceiling. Let it be.

The math

With timeout-minutes already enforcing a hard ceiling on each job:

Stage Model timeout-minutes Practical turns possible (rough) Old --max-turns % of time budget
triage Sonnet 25 ~300–500 20 4–6%
research Opus 30 ~180–360 40 11–22%
propose-fix Opus 60 ~360–600 100 17–28%
write-docs Sonnet 30 ~360–600 30 5–8%
review-a Sonnet 20 ~240–400 25 6–10%
review-b Sonnet 15 ~180–300 30 10–17%

Every old cap was a cliff at 4–28% of the time budget — well below where the timeout would kick in for a runaway. The cap was the failure mode, not the safety mechanism.

What changed

Stage Old New Multiplier Effectively unreachable?
bot-triage 20 200 10× Yes (timeout caps at ~300–500)
bot-research 40 400 10× Yes (timeout caps at ~180–360) — likely never hit either
bot-propose-fix 100 1000 10× Yes (timeout caps at ~360–600)
bot-write-docs 30 300 10× Yes (timeout caps at ~360–600)
bot-review-a 25 250 10× Yes (timeout caps at ~240–400)
bot-review-b 30 300 10× Yes (timeout caps at ~180–300)

After this lands, the active ceiling for every stage is timeout-minutes, not --max-turns. A genuine runaway loop can still consume the full time budget worth of model spend (~$30–50 worst case for propose-fix), but that's already the reality today — turn caps below the timeout-equivalent are pure premature optimization.

The data-collection plan

Run for a few weeks with these ceilings effectively non-binding. Collect:

  • Actual num_turns distribution per stage across a variety of issue shapes (bugs, framework-design, docs-requests; trivial vs complex)
  • Cost distribution per stage
  • Cap-hit rate (should be near zero)
  • Timeout-hit rate (should be near zero unless something is genuinely broken)

After a representative sample, we can either tighten caps based on observed P95/P99 turn counts (predictable optimization) or leave as-is if the data shows everything fits comfortably.

What this does not change

  • Workflow timeout-minutes — unchanged, still the actual ceiling
  • WHEELS_BOT_ENABLED=false kill switch — unchanged, still the active-incident safety net
  • The [skip-claude] per-issue/PR opt-out — unchanged
  • Any prompt logic, allowlist, or model selection
  • The bot-identity author checks on every auto-fire if: block

Test plan

  • Merge.
  • Re-dispatch bot-write-docs.yml on issue #2536 (orphan branch deleted, marker absent).
  • Confirm: write-docs completes naturally (turns far below 300), opens its own PR, posts comment.
  • Watch the rest of the cascade run cleanly with the new headroom.
  • No follow-up bumps expected — if any stage hits one of these new caps, that's a signal worth investigating (likely a prompt bug or a genuinely massive issue).

🤖 Generated with Claude Code

wheels-bot[bot]
wheels-bot Bot previously approved these changes May 10, 2026
Copy link
Copy Markdown
Contributor

@wheels-bot wheels-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wheels Bot — Reviewer A

TL;DR: This PR bumps --max-turns in bot-write-docs.yml from 30 to 50 after two back-to-back runs on issue #2536 showed the budget was the binding constraint. The diff is a single line, the justification is data-driven, and the change is safe to merge as-is. Verdict: approve.

Correctness

The change is exactly what it claims. The two run records in the PR body (32 turns / success, 31 turns / error_max_turns) directly substantiate the "budget pressure, not waste" claim. The new 50-turn ceiling gives approximately 50% headroom over the observed worst case without crossing any new risk boundary.

One proportionality check:

Workflow max-turns timeout-minutes ratio
bot-propose-fix 100 60 1.67/min
bot-write-docs (after) 50 30 1.67/min
bot-review-b 30 15 2.00/min

The job timeout-minutes: 30 (line 27) is proportional to the new budget — same ratio as propose-fix. No risk of wall-clock timeout materialising before the turn cap fires.

The "Push branch" step (line 106) already guards against no-op pushes; the allowed-tools list is unchanged. No new surface exposure.

Commits

ci: bump bot-write-docs max-turns 30 → 50 — valid type (ci), no scope (correct: CI files have no matching scope in the allowlist), subject 40 chars, not ALL-CAPS.

Docs

No behaviour visible to users or framework consumers changed. No .ai/, CLAUDE.md, or CHANGELOG.md update is warranted for a CI budget bump.

@wheels-bot
Copy link
Copy Markdown
Contributor

wheels-bot Bot commented May 10, 2026

Wheels Bot — Reviewer B (round 1)

A's review is accurate and proportionate. The single-line bump is supported by run-level evidence (32 turns / success, 31 turns / error_max_turns), the proportionality table is arithmetically correct (timeout-minutes: 30 confirmed at line 27, 1.67 turns/min matches propose-fix), and the commit message is valid per commitlint.config.js.

Sycophancy

None detected. The approval is grounded in the data presented in the PR body, not reflexive praise.

False positives

None detected. Every factual claim A made was verified against the diff and the live workflow file:

  • timeout-minutes: 30 is at line 27 — confirmed.
  • Ratio table values (1.67/min for write-docs after and propose-fix) are arithmetically correct.
  • No scope on the commit is the right call — ci files have no matching scope in the commitlint allowlist.
  • No .ai/ or CLAUDE.md update needed — correct for a CI config change with no user-facing or framework-consumer impact.

Missed issues

None detected. One minor observation A could have noted but did not need to flag: the PR body mentions the first run overshot the 30-turn cap by ~2 turns on its success path, so a 50-turn cap could in theory also overshoot by ~2. That does not affect correctness of the fix and does not warrant a change request.

Verdict alignment

APPROVED is consistent with a single-line, data-backed CI budget bump with no security, migration, deploy, or DI surface exposure.

@bpamiri bpamiri changed the title ci: bump bot-write-docs max-turns 30 → 50 ci: audit bot stage turn caps for generous headroom May 10, 2026
Copy link
Copy Markdown
Contributor

@wheels-bot wheels-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wheels Bot — Reviewer A

TL;DR: This PR raises --max-turns across all six bot-stage workflows to give headroom against stochastic error_max_turns failures, with strong data-driven justification for the immediate trigger (write-docs, #2536). The direction is correct and the asymmetry argument is sound. One correctness concern warrants attention before merge: the timeout-minutes ceilings on several stages have not been adjusted proportionally, meaning that for bot-review-b and bot-review-a in particular, the wall-clock timeout now fires before the turn cap can if a run genuinely needs many turns. Verdict: comment (correctness concern — not blocking for low-probability cases, but worth resolving for internal consistency).


Correctness

Timeout / turn-cap misalignment (.github/workflows/bot-review-a.yml line 19, bot-review-b.yml line 18, bot-write-docs.yml line 27)

The PR raises --max-turns but does not touch timeout-minutes. This creates a logical inconsistency for three stages: if a run legitimately hits the new cap, the wall-clock timeout fires first — and unlike error_max_turns, a GitHub Actions wall-clock kill does not allow Claude to write its idempotency marker before exiting. That leaves the issue/PR without a marker, which can cause spurious re-triggers on the next qualifying event.

Stage timeout-minutes New max-turns Time at 20 s/turn Fits?
bot-triage 25 40 ~13 min Yes
bot-research 30 80 ~27 min Marginal (Opus turns can be slower)
bot-propose-fix 60 150 ~50 min Yes
bot-write-docs 30 100 ~33 min No — 3 min over
bot-review-a 20 60 ~20 min Borderline — exactly at limit
bot-review-b 15 60 ~20 min No — 5 min over

bot-review-b is the most at-risk: 60 turns at a conservative 20 s/turn exceeds the 15-minute budget by a third. bot-review-a sits right at its ceiling with zero slack. bot-write-docs at 100 turns and 30 minutes is fine in practice (observed runs use ~32 turns), but the logical maximum is incoherent.

Suggested fix — bump timeout-minutes to match:

# bot-review-a.yml
timeout-minutes: 35   # 60 turns x ~30 s/turn (generous) + startup overhead

# bot-review-b.yml
timeout-minutes: 35   # same reasoning; currently 15 is clearly too tight for new cap

# bot-write-docs.yml
timeout-minutes: 60   # 100 turns x ~30 s/turn; matches propose-fix precedent

Alternatively, document explicitly in each job comment that the wall-clock timeout is the intended hard cap for pathological runs and accept that marker state may be left inconsistent in those cases. The current PR body does not address this trade-off.

Evidence-base asymmetry (PR body, bot-research row)

The table in the PR body notes bot-research is "untested in production" yet still doubles its cap from 40 to 80. This is acceptable on the asymmetry argument, but a bot-research run using 70+ turns against a 30-minute timeout-minutes is plausible (Opus averages slower turn-around than Sonnet). Suggest bumping timeout-minutes on bot-research to at least 45 minutes if 80 turns is the new ceiling.


Commits

ci: audit bot stage turn caps for generous headroom — valid type (ci), no scope (correct; .github/workflows/ maps to no allowlist scope per CLAUDE.md § Commit Message Conventions), subject 49 chars, not ALL-CAPS. Clean.


Docs

No user-facing behaviour changed. No .ai/, CLAUDE.md, or CHANGELOG.md update is required for internal CI budget adjustments.


Tests / Security

Not applicable to CI workflow configuration files.

…ling

Premature optimization on a dimension we don't have data for. The
real ceiling is each workflow's `timeout-minutes` (5-60 min depending
on stage), and the real safety net is `WHEELS_BOT_ENABLED=false`.
Turn caps that compete with those ceilings just create a redundant
cliff, and tonight we discovered painfully that misjudging where it
sits costs a full failed run + manual cleanup + re-dispatch.

Math that motivates the lift: with `timeout-minutes: 60` and Opus at
~5-10s per turn, propose-fix can practically execute ~360-600 turns
before the timeout triggers anyway. A 100-turn cap was a cliff at
~17% of the time budget. A 1000-turn cap is unreachable — the
timeout catches it first, which is what we want.

Plan: run with these ceilings for a few weeks, collect data on
actual turn counts across a variety of issue shapes, then revisit
with empirical evidence rather than guesses.

  bot-triage:       20 →  200   (Sonnet, timeout 25m)
  bot-research:     40 →  400   (Opus, timeout 30m)
  bot-propose-fix: 100 → 1000   (Opus, timeout 60m — effectively no cap;
                                  workflow timeout becomes the ceiling)
  bot-write-docs:   30 →  300   (Sonnet, timeout 30m)
  bot-review-a:     25 →  250   (Sonnet, timeout 20m)
  bot-review-b:     30 →  300   (Sonnet, timeout 15m)

Worst-case incremental cost if a stage hits its new cap before the
workflow timeout: ~$50 for propose-fix Opus at 1000 turns, similar
for research. In practice the workflow timeout will catch any
runaway long before max-turns does. The kill switch
(`WHEELS_BOT_ENABLED=false`) is the manual safety net for active
incidents.

Subsumes both the original 30 → 50 single-stage write-docs bump and
the 2x audit. The 10x lift is the right level of generosity for the
data-collection period before we tighten anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bpamiri bpamiri force-pushed the peter/bot-write-docs-bump-turns branch from 32d5e92 to ef63d80 Compare May 10, 2026 05:08
@bpamiri bpamiri changed the title ci: audit bot stage turn caps for generous headroom ci: lift bot stage turn caps to ~10x — let timeout-minutes be the ceiling May 10, 2026
@wheels-bot
Copy link
Copy Markdown
Contributor

wheels-bot Bot commented May 10, 2026

Wheels Bot — Reviewer B (round 2)

A's core finding (timeout/turn-cap misalignment for bot-review-b, bot-review-a, and bot-write-docs) is factually correct and concretely supported. The arithmetic and workflow file values are accurate. One missed issue: A applies the Opus-slowness caveat to bot-research but not to bot-propose-fix, which also runs on Opus and has the same structural risk at its new cap.

Sycophancy

None detected. A issued COMMENTED rather than APPROVED, backed by a specific, verifiable correctness concern with an arithmetic table. No reflexive praise.

False positives

None detected. Confirmed the actual timeout-minutes values via the workflow files:

Stage timeout-minutes (actual) New max-turns
bot-review-b 15 60
bot-review-a 20 60
bot-write-docs 30 100
bot-research 30 80
bot-propose-fix 60 150
bot-triage 25 40

A's table values are accurate. The marker-state corruption risk A describes (wall-clock kill leaving no idempotency marker) is real — GitHub Actions does not give the process a graceful shutdown window when timeout-minutes fires.

Missed issues

bot-propose-fix timeout also at risk under Opus turns — A explicitly notes "Opus averages slower turn-around than Sonnet" when flagging bot-research, then marks bot-propose-fix as "Yes / Fits" using the same 20 s/turn assumption it questioned for Opus. Bot-propose-fix also uses Opus (--model claude-opus-4-7). At 30 s/turn — the same Opus-adjusted rate A implies for research — 150 turns x 30 s = 75 min, which exceeds the 60-minute timeout by the same relative margin A found for bot-research. A should have flagged propose-fix consistently with its own caveat, or defended why propose-fix is exempt (e.g., observed runs stay well under 100 turns, so the pathological ceiling is less plausible).

Verdict alignment

COMMENTED is within the reasonable range: the misalignment is real but the failure probability is low (it requires a run that legitimately hits the new ceiling). CHANGES_REQUESTED would also have been defensible given the marker-state corruption consequence. The verdict is consistent with A's own framing of the concern as "not blocking for low-probability cases."

@bpamiri bpamiri merged commit 8ce5bc9 into develop May 10, 2026
5 checks passed
@bpamiri bpamiri deleted the peter/bot-write-docs-bump-turns branch May 10, 2026 05:09
Copy link
Copy Markdown
Contributor

@wheels-bot wheels-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wheels Bot — Reviewer A

TL;DR: This PR replaces the intermediate 2× turn-cap bump (reviewed at sha 32d5e92) with a 10× bump that makes every --max-turns value effectively unreachable, shifting the active ceiling to timeout-minutes on all six stages. The design rationale is self-consistent and honest about the trade-offs. One pre-existing correctness concern — timeout-killed jobs cannot post idempotency markers — is now the only failure path for runaway jobs on the short-timeout stages, which increases its practical probability. That deserves explicit acknowledgement in the code rather than just the PR body. Verdict: comment.


Correctness

The turn-cap/timeout ratio is now coherent — this directly addresses the previous review's concern. At even an aggressive 20 s/turn:

Stage timeout-minutes New max-turns Turns possible in window Cap reachable?
bot-triage 25 200 ~75 No
bot-research 30 400 ~90 No
bot-propose-fix 60 1000 ~180 No
bot-write-docs 30 300 ~90 No
bot-review-a 20 250 ~60 No
bot-review-b 15 300 ~45 No

timeout-minutes is now unambiguously the active ceiling for every stage. The design is internally consistent.

Idempotency gap on timeout-killed jobs (.github/workflows/bot-review-b.yml line 18, .github/workflows/bot-review-a.yml line 19)

This concern was raised in the previous review and is unresolved by design choice. With turn caps effectively unreachable, a job that goes pathological will always exit via GitHub Actions wall-clock kill — not via Claude's own error_max_turns handler. The distinction matters: error_max_turns lets Claude finish its current tool call and post a final message; a wall-clock kill sends SIGTERM to the runner process and Claude cannot post the idempotency marker before exit.

The consequence: if bot-review-b (15 min timeout) gets a weird PR that drives it toward its limit, it exits without posting <!-- wheels-bot:review-b:... -->. The next qualifying event (another review push) re-triggers the stage. For bot-review-b this is low-stakes (duplicate review comment, not a code push), but for bot-write-docs and bot-propose-fix a re-trigger opens a duplicate branch.

The PR body states this is an accepted trade-off. That is a defensible position — but the trade-off currently lives only in this PR description, which will not be visible to the next person who diagnoses a double-trigger. Suggest adding a one-line comment above each affected job, e.g.:

# timeout-minutes is the active ceiling; max-turns is set high enough to be unreachable.
# A timeout-killed job exits without posting its idempotency marker — re-triggers are
# possible but expected to be rare and low-severity (duplicate comment, not duplicate push).
timeout-minutes: 15

This is a docs/convention finding, not a blocker.

Cost risk surface is unchanged in practice: the PR body is correct that a propose-fix Opus run consuming its full 60-minute budget was already possible before this change (the timeout was always there). The turn cap provided imperfect protection only against the narrow band of runs that would have hit the cap before the timeout. That band was small and the protection was brittle. Removing it while explicitly embracing the timeout as the ceiling is the cleaner design.


Commits

ci: lift bot stage turn caps to ~10x — let timeout-minutes be the ceiling — valid type (ci), no scope (correct; .github/workflows/ has no allowlist scope per CLAUDE.md § Commit Message Conventions), subject 59 chars, not ALL-CAPS. Communicates both the what and the why. Clean.


Docs / Tests / Security

Not applicable. No user-facing behaviour changed; no framework code touched; no credentials in scope.

@wheels-bot
Copy link
Copy Markdown
Contributor

wheels-bot Bot commented May 10, 2026

Wheels Bot — Reviewer B (round 3)

A's review is accurate and the verdict (COMMENTED) is appropriate. One minor internal inconsistency in the turn-cap reachability table; no false positives or sycophancy.

Sycophancy

None detected. A issued COMMENTED backed by a concrete finding (idempotency marker gap), not reflexive praise.

False positives

One minor inconsistency: A's table uses 20 s/turn as the aggressive estimate, giving bot-review-b 45 turns possible in its 15-minute window — comfortably below the new 300-turn cap. But the PR body's own math column for bot-review-b says "~180-300" practical turns in that same 15-minute window, implying ~3-5 s/turn for Sonnet. At the PR body's upper bound (3 s/turn), 300 turns fit in exactly 15 minutes — meaning the cap is borderline reachable, not firmly unreachable as A's table states.

This is not a consequential error: A's conservative estimate is the safer direction for a reachability analysis, and in practice bot-review-b runs stay well under 50 turns. But A's "Cap reachable? No" for bot-review-b is weakly supported by the PR body's own turn-rate estimates.

All other table values and factual claims are correct.

Missed issues

None detected. A caught the idempotency gap, the duplicate-branch risk for write-docs and propose-fix, and correctly validated the commit message. No CFML or framework code was touched; "not applicable" for docs/tests/security is correct.

Verdict alignment

COMMENTED is consistent with A's findings: the idempotency concern is real but pre-existing (not introduced by this PR), and the suggestion (add a YAML comment) is a docs nudge. No security, migration, deploy, or DI surface requires a human-only gate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant