ci: lift bot stage turn caps to ~10x — let timeout-minutes be the ceiling#2538
ci: lift bot stage turn caps to ~10x — let timeout-minutes be the ceiling#2538
Conversation
There was a problem hiding this comment.
Wheels Bot — Reviewer A
TL;DR: This PR bumps --max-turns in bot-write-docs.yml from 30 to 50 after two back-to-back runs on issue #2536 showed the budget was the binding constraint. The diff is a single line, the justification is data-driven, and the change is safe to merge as-is. Verdict: approve.
Correctness
The change is exactly what it claims. The two run records in the PR body (32 turns / success, 31 turns / error_max_turns) directly substantiate the "budget pressure, not waste" claim. The new 50-turn ceiling gives approximately 50% headroom over the observed worst case without crossing any new risk boundary.
One proportionality check:
| Workflow | max-turns | timeout-minutes | ratio |
|---|---|---|---|
| bot-propose-fix | 100 | 60 | 1.67/min |
| bot-write-docs (after) | 50 | 30 | 1.67/min |
| bot-review-b | 30 | 15 | 2.00/min |
The job timeout-minutes: 30 (line 27) is proportional to the new budget — same ratio as propose-fix. No risk of wall-clock timeout materialising before the turn cap fires.
The "Push branch" step (line 106) already guards against no-op pushes; the allowed-tools list is unchanged. No new surface exposure.
Commits
ci: bump bot-write-docs max-turns 30 → 50 — valid type (ci), no scope (correct: CI files have no matching scope in the allowlist), subject 40 chars, not ALL-CAPS.
Docs
No behaviour visible to users or framework consumers changed. No .ai/, CLAUDE.md, or CHANGELOG.md update is warranted for a CI budget bump.
c4e1ff0 to
32d5e92
Compare
Wheels Bot — Reviewer B (round 1)A's review is accurate and proportionate. The single-line bump is supported by run-level evidence (32 turns / success, 31 turns / error_max_turns), the proportionality table is arithmetically correct (timeout-minutes: 30 confirmed at line 27, 1.67 turns/min matches propose-fix), and the commit message is valid per commitlint.config.js. SycophancyNone detected. The approval is grounded in the data presented in the PR body, not reflexive praise. False positivesNone detected. Every factual claim A made was verified against the diff and the live workflow file:
Missed issuesNone detected. One minor observation A could have noted but did not need to flag: the PR body mentions the first run overshot the 30-turn cap by ~2 turns on its success path, so a 50-turn cap could in theory also overshoot by ~2. That does not affect correctness of the fix and does not warrant a change request. Verdict alignmentAPPROVED is consistent with a single-line, data-backed CI budget bump with no security, migration, deploy, or DI surface exposure. |
There was a problem hiding this comment.
Wheels Bot — Reviewer A
TL;DR: This PR raises --max-turns across all six bot-stage workflows to give headroom against stochastic error_max_turns failures, with strong data-driven justification for the immediate trigger (write-docs, #2536). The direction is correct and the asymmetry argument is sound. One correctness concern warrants attention before merge: the timeout-minutes ceilings on several stages have not been adjusted proportionally, meaning that for bot-review-b and bot-review-a in particular, the wall-clock timeout now fires before the turn cap can if a run genuinely needs many turns. Verdict: comment (correctness concern — not blocking for low-probability cases, but worth resolving for internal consistency).
Correctness
Timeout / turn-cap misalignment (.github/workflows/bot-review-a.yml line 19, bot-review-b.yml line 18, bot-write-docs.yml line 27)
The PR raises --max-turns but does not touch timeout-minutes. This creates a logical inconsistency for three stages: if a run legitimately hits the new cap, the wall-clock timeout fires first — and unlike error_max_turns, a GitHub Actions wall-clock kill does not allow Claude to write its idempotency marker before exiting. That leaves the issue/PR without a marker, which can cause spurious re-triggers on the next qualifying event.
| Stage | timeout-minutes | New max-turns | Time at 20 s/turn | Fits? |
|---|---|---|---|---|
| bot-triage | 25 | 40 | ~13 min | Yes |
| bot-research | 30 | 80 | ~27 min | Marginal (Opus turns can be slower) |
| bot-propose-fix | 60 | 150 | ~50 min | Yes |
| bot-write-docs | 30 | 100 | ~33 min | No — 3 min over |
| bot-review-a | 20 | 60 | ~20 min | Borderline — exactly at limit |
| bot-review-b | 15 | 60 | ~20 min | No — 5 min over |
bot-review-b is the most at-risk: 60 turns at a conservative 20 s/turn exceeds the 15-minute budget by a third. bot-review-a sits right at its ceiling with zero slack. bot-write-docs at 100 turns and 30 minutes is fine in practice (observed runs use ~32 turns), but the logical maximum is incoherent.
Suggested fix — bump timeout-minutes to match:
# bot-review-a.yml
timeout-minutes: 35 # 60 turns x ~30 s/turn (generous) + startup overhead
# bot-review-b.yml
timeout-minutes: 35 # same reasoning; currently 15 is clearly too tight for new cap
# bot-write-docs.yml
timeout-minutes: 60 # 100 turns x ~30 s/turn; matches propose-fix precedentAlternatively, document explicitly in each job comment that the wall-clock timeout is the intended hard cap for pathological runs and accept that marker state may be left inconsistent in those cases. The current PR body does not address this trade-off.
Evidence-base asymmetry (PR body, bot-research row)
The table in the PR body notes bot-research is "untested in production" yet still doubles its cap from 40 to 80. This is acceptable on the asymmetry argument, but a bot-research run using 70+ turns against a 30-minute timeout-minutes is plausible (Opus averages slower turn-around than Sonnet). Suggest bumping timeout-minutes on bot-research to at least 45 minutes if 80 turns is the new ceiling.
Commits
ci: audit bot stage turn caps for generous headroom — valid type (ci), no scope (correct; .github/workflows/ maps to no allowlist scope per CLAUDE.md § Commit Message Conventions), subject 49 chars, not ALL-CAPS. Clean.
Docs
No user-facing behaviour changed. No .ai/, CLAUDE.md, or CHANGELOG.md update is required for internal CI budget adjustments.
Tests / Security
Not applicable to CI workflow configuration files.
…ling
Premature optimization on a dimension we don't have data for. The
real ceiling is each workflow's `timeout-minutes` (5-60 min depending
on stage), and the real safety net is `WHEELS_BOT_ENABLED=false`.
Turn caps that compete with those ceilings just create a redundant
cliff, and tonight we discovered painfully that misjudging where it
sits costs a full failed run + manual cleanup + re-dispatch.
Math that motivates the lift: with `timeout-minutes: 60` and Opus at
~5-10s per turn, propose-fix can practically execute ~360-600 turns
before the timeout triggers anyway. A 100-turn cap was a cliff at
~17% of the time budget. A 1000-turn cap is unreachable — the
timeout catches it first, which is what we want.
Plan: run with these ceilings for a few weeks, collect data on
actual turn counts across a variety of issue shapes, then revisit
with empirical evidence rather than guesses.
bot-triage: 20 → 200 (Sonnet, timeout 25m)
bot-research: 40 → 400 (Opus, timeout 30m)
bot-propose-fix: 100 → 1000 (Opus, timeout 60m — effectively no cap;
workflow timeout becomes the ceiling)
bot-write-docs: 30 → 300 (Sonnet, timeout 30m)
bot-review-a: 25 → 250 (Sonnet, timeout 20m)
bot-review-b: 30 → 300 (Sonnet, timeout 15m)
Worst-case incremental cost if a stage hits its new cap before the
workflow timeout: ~$50 for propose-fix Opus at 1000 turns, similar
for research. In practice the workflow timeout will catch any
runaway long before max-turns does. The kill switch
(`WHEELS_BOT_ENABLED=false`) is the manual safety net for active
incidents.
Subsumes both the original 30 → 50 single-stage write-docs bump and
the 2x audit. The 10x lift is the right level of generosity for the
data-collection period before we tighten anything.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
32d5e92 to
ef63d80
Compare
Wheels Bot — Reviewer B (round 2)A's core finding (timeout/turn-cap misalignment for bot-review-b, bot-review-a, and bot-write-docs) is factually correct and concretely supported. The arithmetic and workflow file values are accurate. One missed issue: A applies the Opus-slowness caveat to bot-research but not to bot-propose-fix, which also runs on Opus and has the same structural risk at its new cap. SycophancyNone detected. A issued COMMENTED rather than APPROVED, backed by a specific, verifiable correctness concern with an arithmetic table. No reflexive praise. False positivesNone detected. Confirmed the actual
A's table values are accurate. The marker-state corruption risk A describes (wall-clock kill leaving no idempotency marker) is real — GitHub Actions does not give the process a graceful shutdown window when Missed issuesbot-propose-fix timeout also at risk under Opus turns — A explicitly notes "Opus averages slower turn-around than Sonnet" when flagging bot-research, then marks bot-propose-fix as "Yes / Fits" using the same 20 s/turn assumption it questioned for Opus. Bot-propose-fix also uses Opus ( Verdict alignmentCOMMENTED is within the reasonable range: the misalignment is real but the failure probability is low (it requires a run that legitimately hits the new ceiling). CHANGES_REQUESTED would also have been defensible given the marker-state corruption consequence. The verdict is consistent with A's own framing of the concern as "not blocking for low-probability cases." |
There was a problem hiding this comment.
Wheels Bot — Reviewer A
TL;DR: This PR replaces the intermediate 2× turn-cap bump (reviewed at sha 32d5e92) with a 10× bump that makes every --max-turns value effectively unreachable, shifting the active ceiling to timeout-minutes on all six stages. The design rationale is self-consistent and honest about the trade-offs. One pre-existing correctness concern — timeout-killed jobs cannot post idempotency markers — is now the only failure path for runaway jobs on the short-timeout stages, which increases its practical probability. That deserves explicit acknowledgement in the code rather than just the PR body. Verdict: comment.
Correctness
The turn-cap/timeout ratio is now coherent — this directly addresses the previous review's concern. At even an aggressive 20 s/turn:
| Stage | timeout-minutes | New max-turns | Turns possible in window | Cap reachable? |
|---|---|---|---|---|
| bot-triage | 25 | 200 | ~75 | No |
| bot-research | 30 | 400 | ~90 | No |
| bot-propose-fix | 60 | 1000 | ~180 | No |
| bot-write-docs | 30 | 300 | ~90 | No |
| bot-review-a | 20 | 250 | ~60 | No |
| bot-review-b | 15 | 300 | ~45 | No |
timeout-minutes is now unambiguously the active ceiling for every stage. The design is internally consistent.
Idempotency gap on timeout-killed jobs (.github/workflows/bot-review-b.yml line 18, .github/workflows/bot-review-a.yml line 19)
This concern was raised in the previous review and is unresolved by design choice. With turn caps effectively unreachable, a job that goes pathological will always exit via GitHub Actions wall-clock kill — not via Claude's own error_max_turns handler. The distinction matters: error_max_turns lets Claude finish its current tool call and post a final message; a wall-clock kill sends SIGTERM to the runner process and Claude cannot post the idempotency marker before exit.
The consequence: if bot-review-b (15 min timeout) gets a weird PR that drives it toward its limit, it exits without posting <!-- wheels-bot:review-b:... -->. The next qualifying event (another review push) re-triggers the stage. For bot-review-b this is low-stakes (duplicate review comment, not a code push), but for bot-write-docs and bot-propose-fix a re-trigger opens a duplicate branch.
The PR body states this is an accepted trade-off. That is a defensible position — but the trade-off currently lives only in this PR description, which will not be visible to the next person who diagnoses a double-trigger. Suggest adding a one-line comment above each affected job, e.g.:
# timeout-minutes is the active ceiling; max-turns is set high enough to be unreachable.
# A timeout-killed job exits without posting its idempotency marker — re-triggers are
# possible but expected to be rare and low-severity (duplicate comment, not duplicate push).
timeout-minutes: 15This is a docs/convention finding, not a blocker.
Cost risk surface is unchanged in practice: the PR body is correct that a propose-fix Opus run consuming its full 60-minute budget was already possible before this change (the timeout was always there). The turn cap provided imperfect protection only against the narrow band of runs that would have hit the cap before the timeout. That band was small and the protection was brittle. Removing it while explicitly embracing the timeout as the ceiling is the cleaner design.
Commits
ci: lift bot stage turn caps to ~10x — let timeout-minutes be the ceiling — valid type (ci), no scope (correct; .github/workflows/ has no allowlist scope per CLAUDE.md § Commit Message Conventions), subject 59 chars, not ALL-CAPS. Communicates both the what and the why. Clean.
Docs / Tests / Security
Not applicable. No user-facing behaviour changed; no framework code touched; no credentials in scope.
Wheels Bot — Reviewer B (round 3)A's review is accurate and the verdict (COMMENTED) is appropriate. One minor internal inconsistency in the turn-cap reachability table; no false positives or sycophancy. SycophancyNone detected. A issued COMMENTED backed by a concrete finding (idempotency marker gap), not reflexive praise. False positivesOne minor inconsistency: A's table uses 20 s/turn as the This is not a consequential error: A's conservative estimate is the safer direction for a reachability analysis, and in practice bot-review-b runs stay well under 50 turns. But A's "Cap reachable? No" for bot-review-b is weakly supported by the PR body's own turn-rate estimates. All other table values and factual claims are correct. Missed issuesNone detected. A caught the idempotency gap, the duplicate-branch risk for write-docs and propose-fix, and correctly validated the commit message. No CFML or framework code was touched; "not applicable" for docs/tests/security is correct. Verdict alignmentCOMMENTED is consistent with A's findings: the idempotency concern is real but pre-existing (not introduced by this PR), and the suggestion (add a YAML comment) is a docs nudge. No security, migration, deploy, or DI surface requires a human-only gate. |
Summary
Lifts
--max-turnsacross all six bot stages to ~10× the originally-shipped values. Originally scoped as a 30 → 50 single-stage fix forbot-write-docs.yml, then expanded to 2× across the board, then expanded again to ~10× after we recognized:The math
With
timeout-minutesalready enforcing a hard ceiling on each job:--max-turnsEvery old cap was a cliff at 4–28% of the time budget — well below where the timeout would kick in for a runaway. The cap was the failure mode, not the safety mechanism.
What changed
bot-triagebot-researchbot-propose-fixbot-write-docsbot-review-abot-review-bAfter this lands, the active ceiling for every stage is
timeout-minutes, not--max-turns. A genuine runaway loop can still consume the full time budget worth of model spend (~$30–50 worst case for propose-fix), but that's already the reality today — turn caps below the timeout-equivalent are pure premature optimization.The data-collection plan
Run for a few weeks with these ceilings effectively non-binding. Collect:
num_turnsdistribution per stage across a variety of issue shapes (bugs, framework-design, docs-requests; trivial vs complex)After a representative sample, we can either tighten caps based on observed P95/P99 turn counts (predictable optimization) or leave as-is if the data shows everything fits comfortably.
What this does not change
timeout-minutes— unchanged, still the actual ceilingWHEELS_BOT_ENABLED=falsekill switch — unchanged, still the active-incident safety net[skip-claude]per-issue/PR opt-out — unchangedif:blockTest plan
bot-write-docs.ymlon issue #2536 (orphan branch deleted, marker absent).🤖 Generated with Claude Code