ci: lift bot stage turn caps to ~10x — let timeout-minutes be the ceiling by bpamiri · Pull Request #2538 · wheels-dev/wheels

bpamiri · 2026-05-10T04:58:02Z

Summary

Lifts --max-turns across all six bot stages to ~10× the originally-shipped values. Originally scoped as a 30 → 50 single-stage fix for bot-write-docs.yml, then expanded to 2× across the board, then expanded again to ~10× after we recognized:

Turn caps that compete with timeout-minutes just create a redundant cliff. Misjudging where the cliff sits costs a full failed run plus manual cleanup plus a re-dispatch. The timeout is the real ceiling. Let it be.

The math

With timeout-minutes already enforcing a hard ceiling on each job:

Stage	Model	timeout-minutes	Practical turns possible (rough)	Old `--max-turns`	% of time budget
triage	Sonnet	25	~300–500	20	4–6%
research	Opus	30	~180–360	40	11–22%
propose-fix	Opus	60	~360–600	100	17–28%
write-docs	Sonnet	30	~360–600	30	5–8%
review-a	Sonnet	20	~240–400	25	6–10%
review-b	Sonnet	15	~180–300	30	10–17%

Every old cap was a cliff at 4–28% of the time budget — well below where the timeout would kick in for a runaway. The cap was the failure mode, not the safety mechanism.

What changed

Stage	Old	New	Multiplier	Effectively unreachable?
`bot-triage`	20	200	10×	Yes (timeout caps at ~300–500)
`bot-research`	40	400	10×	Yes (timeout caps at ~180–360) — likely never hit either
`bot-propose-fix`	100	1000	10×	Yes (timeout caps at ~360–600)
`bot-write-docs`	30	300	10×	Yes (timeout caps at ~360–600)
`bot-review-a`	25	250	10×	Yes (timeout caps at ~240–400)
`bot-review-b`	30	300	10×	Yes (timeout caps at ~180–300)

After this lands, the active ceiling for every stage is timeout-minutes, not --max-turns. A genuine runaway loop can still consume the full time budget worth of model spend (~$30–50 worst case for propose-fix), but that's already the reality today — turn caps below the timeout-equivalent are pure premature optimization.

The data-collection plan

Run for a few weeks with these ceilings effectively non-binding. Collect:

Actual num_turns distribution per stage across a variety of issue shapes (bugs, framework-design, docs-requests; trivial vs complex)
Cost distribution per stage
Cap-hit rate (should be near zero)
Timeout-hit rate (should be near zero unless something is genuinely broken)

After a representative sample, we can either tighten caps based on observed P95/P99 turn counts (predictable optimization) or leave as-is if the data shows everything fits comfortably.

What this does not change

Workflow timeout-minutes — unchanged, still the actual ceiling
WHEELS_BOT_ENABLED=false kill switch — unchanged, still the active-incident safety net
The [skip-claude] per-issue/PR opt-out — unchanged
Any prompt logic, allowlist, or model selection
The bot-identity author checks on every auto-fire if: block

Test plan

Merge.
Re-dispatch bot-write-docs.yml on issue #2536 (orphan branch deleted, marker absent).
Confirm: write-docs completes naturally (turns far below 300), opens its own PR, posts comment.
Watch the rest of the cascade run cleanly with the new headroom.
No follow-up bumps expected — if any stage hits one of these new caps, that's a signal worth investigating (likely a prompt bug or a genuinely massive issue).

🤖 Generated with Claude Code

wheels-bot

Wheels Bot — Reviewer A

TL;DR: This PR bumps --max-turns in bot-write-docs.yml from 30 to 50 after two back-to-back runs on issue #2536 showed the budget was the binding constraint. The diff is a single line, the justification is data-driven, and the change is safe to merge as-is. Verdict: approve.

Correctness

The change is exactly what it claims. The two run records in the PR body (32 turns / success, 31 turns / error_max_turns) directly substantiate the "budget pressure, not waste" claim. The new 50-turn ceiling gives approximately 50% headroom over the observed worst case without crossing any new risk boundary.

One proportionality check:

Workflow	max-turns	timeout-minutes	ratio
bot-propose-fix	100	60	1.67/min
bot-write-docs (after)	50	30	1.67/min
bot-review-b	30	15	2.00/min

The job timeout-minutes: 30 (line 27) is proportional to the new budget — same ratio as propose-fix. No risk of wall-clock timeout materialising before the turn cap fires.

The "Push branch" step (line 106) already guards against no-op pushes; the allowed-tools list is unchanged. No new surface exposure.

Commits

ci: bump bot-write-docs max-turns 30 → 50 — valid type (ci), no scope (correct: CI files have no matching scope in the allowlist), subject 40 chars, not ALL-CAPS.

Docs

No behaviour visible to users or framework consumers changed. No .ai/, CLAUDE.md, or CHANGELOG.md update is warranted for a CI budget bump.

wheels-bot · 2026-05-10T05:02:30Z

Wheels Bot — Reviewer B (round 1)

A's review is accurate and proportionate. The single-line bump is supported by run-level evidence (32 turns / success, 31 turns / error_max_turns), the proportionality table is arithmetically correct (timeout-minutes: 30 confirmed at line 27, 1.67 turns/min matches propose-fix), and the commit message is valid per commitlint.config.js.

Sycophancy

None detected. The approval is grounded in the data presented in the PR body, not reflexive praise.

False positives

None detected. Every factual claim A made was verified against the diff and the live workflow file:

timeout-minutes: 30 is at line 27 — confirmed.
Ratio table values (1.67/min for write-docs after and propose-fix) are arithmetically correct.
No scope on the commit is the right call — ci files have no matching scope in the commitlint allowlist.
No .ai/ or CLAUDE.md update needed — correct for a CI config change with no user-facing or framework-consumer impact.

Missed issues

None detected. One minor observation A could have noted but did not need to flag: the PR body mentions the first run overshot the 30-turn cap by ~2 turns on its success path, so a 50-turn cap could in theory also overshoot by ~2. That does not affect correctness of the fix and does not warrant a change request.

Verdict alignment

APPROVED is consistent with a single-line, data-backed CI budget bump with no security, migration, deploy, or DI surface exposure.

wheels-bot

Wheels Bot — Reviewer A

TL;DR: This PR raises --max-turns across all six bot-stage workflows to give headroom against stochastic error_max_turns failures, with strong data-driven justification for the immediate trigger (write-docs, #2536). The direction is correct and the asymmetry argument is sound. One correctness concern warrants attention before merge: the timeout-minutes ceilings on several stages have not been adjusted proportionally, meaning that for bot-review-b and bot-review-a in particular, the wall-clock timeout now fires before the turn cap can if a run genuinely needs many turns. Verdict: comment (correctness concern — not blocking for low-probability cases, but worth resolving for internal consistency).

Correctness

Timeout / turn-cap misalignment (.github/workflows/bot-review-a.yml line 19, bot-review-b.yml line 18, bot-write-docs.yml line 27)

The PR raises --max-turns but does not touch timeout-minutes. This creates a logical inconsistency for three stages: if a run legitimately hits the new cap, the wall-clock timeout fires first — and unlike error_max_turns, a GitHub Actions wall-clock kill does not allow Claude to write its idempotency marker before exiting. That leaves the issue/PR without a marker, which can cause spurious re-triggers on the next qualifying event.

Stage	timeout-minutes	New max-turns	Time at 20 s/turn	Fits?
bot-triage	25	40	~13 min	Yes
bot-research	30	80	~27 min	Marginal (Opus turns can be slower)
bot-propose-fix	60	150	~50 min	Yes
bot-write-docs	30	100	~33 min	No — 3 min over
bot-review-a	20	60	~20 min	Borderline — exactly at limit
bot-review-b	15	60	~20 min	No — 5 min over

bot-review-b is the most at-risk: 60 turns at a conservative 20 s/turn exceeds the 15-minute budget by a third. bot-review-a sits right at its ceiling with zero slack. bot-write-docs at 100 turns and 30 minutes is fine in practice (observed runs use ~32 turns), but the logical maximum is incoherent.

Suggested fix — bump timeout-minutes to match:

# bot-review-a.yml
timeout-minutes: 35   # 60 turns x ~30 s/turn (generous) + startup overhead

# bot-review-b.yml
timeout-minutes: 35   # same reasoning; currently 15 is clearly too tight for new cap

# bot-write-docs.yml
timeout-minutes: 60   # 100 turns x ~30 s/turn; matches propose-fix precedent

Alternatively, document explicitly in each job comment that the wall-clock timeout is the intended hard cap for pathological runs and accept that marker state may be left inconsistent in those cases. The current PR body does not address this trade-off.

Evidence-base asymmetry (PR body, bot-research row)

The table in the PR body notes bot-research is "untested in production" yet still doubles its cap from 40 to 80. This is acceptable on the asymmetry argument, but a bot-research run using 70+ turns against a 30-minute timeout-minutes is plausible (Opus averages slower turn-around than Sonnet). Suggest bumping timeout-minutes on bot-research to at least 45 minutes if 80 turns is the new ceiling.

Commits

ci: audit bot stage turn caps for generous headroom — valid type (ci), no scope (correct; .github/workflows/ maps to no allowlist scope per CLAUDE.md § Commit Message Conventions), subject 49 chars, not ALL-CAPS. Clean.

Docs

No user-facing behaviour changed. No .ai/, CLAUDE.md, or CHANGELOG.md update is required for internal CI budget adjustments.

Tests / Security

Not applicable to CI workflow configuration files.

…ling Premature optimization on a dimension we don't have data for. The real ceiling is each workflow's `timeout-minutes` (5-60 min depending on stage), and the real safety net is `WHEELS_BOT_ENABLED=false`. Turn caps that compete with those ceilings just create a redundant cliff, and tonight we discovered painfully that misjudging where it sits costs a full failed run + manual cleanup + re-dispatch. Math that motivates the lift: with `timeout-minutes: 60` and Opus at ~5-10s per turn, propose-fix can practically execute ~360-600 turns before the timeout triggers anyway. A 100-turn cap was a cliff at ~17% of the time budget. A 1000-turn cap is unreachable — the timeout catches it first, which is what we want. Plan: run with these ceilings for a few weeks, collect data on actual turn counts across a variety of issue shapes, then revisit with empirical evidence rather than guesses. bot-triage: 20 → 200 (Sonnet, timeout 25m) bot-research: 40 → 400 (Opus, timeout 30m) bot-propose-fix: 100 → 1000 (Opus, timeout 60m — effectively no cap; workflow timeout becomes the ceiling) bot-write-docs: 30 → 300 (Sonnet, timeout 30m) bot-review-a: 25 → 250 (Sonnet, timeout 20m) bot-review-b: 30 → 300 (Sonnet, timeout 15m) Worst-case incremental cost if a stage hits its new cap before the workflow timeout: ~$50 for propose-fix Opus at 1000 turns, similar for research. In practice the workflow timeout will catch any runaway long before max-turns does. The kill switch (`WHEELS_BOT_ENABLED=false`) is the manual safety net for active incidents. Subsumes both the original 30 → 50 single-stage write-docs bump and the 2x audit. The 10x lift is the right level of generosity for the data-collection period before we tighten anything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wheels-bot · 2026-05-10T05:09:09Z

Wheels Bot — Reviewer B (round 2)

A's core finding (timeout/turn-cap misalignment for bot-review-b, bot-review-a, and bot-write-docs) is factually correct and concretely supported. The arithmetic and workflow file values are accurate. One missed issue: A applies the Opus-slowness caveat to bot-research but not to bot-propose-fix, which also runs on Opus and has the same structural risk at its new cap.

Sycophancy

None detected. A issued COMMENTED rather than APPROVED, backed by a specific, verifiable correctness concern with an arithmetic table. No reflexive praise.

False positives

None detected. Confirmed the actual timeout-minutes values via the workflow files:

Stage	timeout-minutes (actual)	New max-turns
bot-review-b	15	60
bot-review-a	20	60
bot-write-docs	30	100
bot-research	30	80
bot-propose-fix	60	150
bot-triage	25	40

A's table values are accurate. The marker-state corruption risk A describes (wall-clock kill leaving no idempotency marker) is real — GitHub Actions does not give the process a graceful shutdown window when timeout-minutes fires.

Missed issues

bot-propose-fix timeout also at risk under Opus turns — A explicitly notes "Opus averages slower turn-around than Sonnet" when flagging bot-research, then marks bot-propose-fix as "Yes / Fits" using the same 20 s/turn assumption it questioned for Opus. Bot-propose-fix also uses Opus (--model claude-opus-4-7). At 30 s/turn — the same Opus-adjusted rate A implies for research — 150 turns x 30 s = 75 min, which exceeds the 60-minute timeout by the same relative margin A found for bot-research. A should have flagged propose-fix consistently with its own caveat, or defended why propose-fix is exempt (e.g., observed runs stay well under 100 turns, so the pathological ceiling is less plausible).

Verdict alignment

COMMENTED is within the reasonable range: the misalignment is real but the failure probability is low (it requires a run that legitimately hits the new ceiling). CHANGES_REQUESTED would also have been defensible given the marker-state corruption consequence. The verdict is consistent with A's own framing of the concern as "not blocking for low-probability cases."

wheels-bot

Wheels Bot — Reviewer A

TL;DR: This PR replaces the intermediate 2× turn-cap bump (reviewed at sha 32d5e92) with a 10× bump that makes every --max-turns value effectively unreachable, shifting the active ceiling to timeout-minutes on all six stages. The design rationale is self-consistent and honest about the trade-offs. One pre-existing correctness concern — timeout-killed jobs cannot post idempotency markers — is now the only failure path for runaway jobs on the short-timeout stages, which increases its practical probability. That deserves explicit acknowledgement in the code rather than just the PR body. Verdict: comment.

Correctness

The turn-cap/timeout ratio is now coherent — this directly addresses the previous review's concern. At even an aggressive 20 s/turn:

Stage	timeout-minutes	New max-turns	Turns possible in window	Cap reachable?
bot-triage	25	200	~75	No
bot-research	30	400	~90	No
bot-propose-fix	60	1000	~180	No
bot-write-docs	30	300	~90	No
bot-review-a	20	250	~60	No
bot-review-b	15	300	~45	No

timeout-minutes is now unambiguously the active ceiling for every stage. The design is internally consistent.

Idempotency gap on timeout-killed jobs (.github/workflows/bot-review-b.yml line 18, .github/workflows/bot-review-a.yml line 19)

This concern was raised in the previous review and is unresolved by design choice. With turn caps effectively unreachable, a job that goes pathological will always exit via GitHub Actions wall-clock kill — not via Claude's own error_max_turns handler. The distinction matters: error_max_turns lets Claude finish its current tool call and post a final message; a wall-clock kill sends SIGTERM to the runner process and Claude cannot post the idempotency marker before exit.

The consequence: if bot-review-b (15 min timeout) gets a weird PR that drives it toward its limit, it exits without posting . The next qualifying event (another review push) re-triggers the stage. For bot-review-b this is low-stakes (duplicate review comment, not a code push), but for bot-write-docs and bot-propose-fix a re-trigger opens a duplicate branch.

The PR body states this is an accepted trade-off. That is a defensible position — but the trade-off currently lives only in this PR description, which will not be visible to the next person who diagnoses a double-trigger. Suggest adding a one-line comment above each affected job, e.g.:

# timeout-minutes is the active ceiling; max-turns is set high enough to be unreachable.
# A timeout-killed job exits without posting its idempotency marker — re-triggers are
# possible but expected to be rare and low-severity (duplicate comment, not duplicate push).
timeout-minutes: 15

This is a docs/convention finding, not a blocker.

Cost risk surface is unchanged in practice: the PR body is correct that a propose-fix Opus run consuming its full 60-minute budget was already possible before this change (the timeout was always there). The turn cap provided imperfect protection only against the narrow band of runs that would have hit the cap before the timeout. That band was small and the protection was brittle. Removing it while explicitly embracing the timeout as the ceiling is the cleaner design.

Commits

ci: lift bot stage turn caps to ~10x — let timeout-minutes be the ceiling — valid type (ci), no scope (correct; .github/workflows/ has no allowlist scope per CLAUDE.md § Commit Message Conventions), subject 59 chars, not ALL-CAPS. Communicates both the what and the why. Clean.

Docs / Tests / Security

Not applicable. No user-facing behaviour changed; no framework code touched; no credentials in scope.

wheels-bot · 2026-05-10T05:15:41Z

Wheels Bot — Reviewer B (round 3)

A's review is accurate and the verdict (COMMENTED) is appropriate. One minor internal inconsistency in the turn-cap reachability table; no false positives or sycophancy.

Sycophancy

None detected. A issued COMMENTED backed by a concrete finding (idempotency marker gap), not reflexive praise.

False positives

One minor inconsistency: A's table uses 20 s/turn as the aggressive estimate, giving bot-review-b 45 turns possible in its 15-minute window — comfortably below the new 300-turn cap. But the PR body's own math column for bot-review-b says "~180-300" practical turns in that same 15-minute window, implying ~3-5 s/turn for Sonnet. At the PR body's upper bound (3 s/turn), 300 turns fit in exactly 15 minutes — meaning the cap is borderline reachable, not firmly unreachable as A's table states.

This is not a consequential error: A's conservative estimate is the safer direction for a reachability analysis, and in practice bot-review-b runs stay well under 50 turns. But A's "Cap reachable? No" for bot-review-b is weakly supported by the PR body's own turn-rate estimates.

All other table values and factual claims are correct.

Missed issues

None detected. A caught the idempotency gap, the duplicate-branch risk for write-docs and propose-fix, and correctly validated the commit message. No CFML or framework code was touched; "not applicable" for docs/tests/security is correct.

Verdict alignment

COMMENTED is consistent with A's findings: the idempotency concern is real but pre-existing (not introduced by this PR), and the suggestion (add a YAML comment) is a docs nudge. No security, migration, deploy, or DI surface requires a human-only gate.

wheels-bot Bot previously approved these changes May 10, 2026

View reviewed changes

bpamiri dismissed wheels-bot[bot]’s stale review via 32d5e92 May 10, 2026 05:02

bpamiri force-pushed the peter/bot-write-docs-bump-turns branch from c4e1ff0 to 32d5e92 Compare May 10, 2026 05:02

bpamiri changed the title ~~ci: bump bot-write-docs max-turns 30 → 50~~ ci: audit bot stage turn caps for generous headroom May 10, 2026

wheels-bot Bot reviewed May 10, 2026

View reviewed changes

bpamiri force-pushed the peter/bot-write-docs-bump-turns branch from 32d5e92 to ef63d80 Compare May 10, 2026 05:08

bpamiri changed the title ~~ci: audit bot stage turn caps for generous headroom~~ ci: lift bot stage turn caps to ~10x — let timeout-minutes be the ceiling May 10, 2026

bpamiri merged commit 8ce5bc9 into develop May 10, 2026
5 checks passed

bpamiri deleted the peter/bot-write-docs-bump-turns branch May 10, 2026 05:09

wheels-bot Bot reviewed May 10, 2026

View reviewed changes

bpamiri mentioned this pull request May 10, 2026

ci: add convergence loop with address-review + senior-advisor stages #2540

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: lift bot stage turn caps to ~10x — let timeout-minutes be the ceiling#2538

ci: lift bot stage turn caps to ~10x — let timeout-minutes be the ceiling#2538
bpamiri merged 1 commit intodevelopfrom
peter/bot-write-docs-bump-turns

bpamiri commented May 10, 2026 •

edited

Loading

Uh oh!

wheels-bot Bot left a comment

Uh oh!

wheels-bot Bot commented May 10, 2026

Uh oh!

wheels-bot Bot left a comment

Uh oh!

wheels-bot Bot commented May 10, 2026

Uh oh!

Uh oh!

wheels-bot Bot left a comment

Uh oh!

wheels-bot Bot commented May 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bpamiri commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The math

What changed

The data-collection plan

What this does not change

Test plan

Uh oh!

wheels-bot Bot left a comment

Choose a reason for hiding this comment

Wheels Bot — Reviewer A

Correctness

Commits

Docs

Uh oh!

wheels-bot Bot commented May 10, 2026

Wheels Bot — Reviewer B (round 1)

Sycophancy

False positives

Missed issues

Verdict alignment

Uh oh!

wheels-bot Bot left a comment

Choose a reason for hiding this comment

Wheels Bot — Reviewer A

Correctness

Commits

Docs

Tests / Security

Uh oh!

wheels-bot Bot commented May 10, 2026

Wheels Bot — Reviewer B (round 2)

Sycophancy

False positives

Missed issues

Verdict alignment

Uh oh!

Uh oh!

wheels-bot Bot left a comment

Choose a reason for hiding this comment

Wheels Bot — Reviewer A

Correctness

Commits

Docs / Tests / Security

Uh oh!

wheels-bot Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Wheels Bot — Reviewer B (round 3)

Sycophancy

False positives

Missed issues

Verdict alignment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bpamiri commented May 10, 2026 •

edited

Loading

wheels-bot Bot commented May 10, 2026 •

edited

Loading