ci: bump propose-fix max-turns 60 → 100 + temporary show_full_output (diagnostic)#2531
ci: bump propose-fix max-turns 60 → 100 + temporary show_full_output (diagnostic)#2531
Conversation
Three propose-fix attempts on the registry-list bug have hit the 60-turn ceiling identically: | Run | Turns | Cost | Denials | |--------------------|-------|---------|---------| | #25616307210 | 61 | $3.66 | 15 | | #25616508963 | 61 | $3.36 | 19 | | #25617692327 | 61 | $3.63 | 18 | The PR #2528 step-9 scope cut (4 doc locations → 1 CHANGELOG entry) did not move propose-fix off the cliff — the doc updates were not the dominant scope sink. The persistent 15-19 permission denials per run suggest Opus is wasting ~30% of the budget on tool calls outside the allowlist. This is a diagnostic bump: lift the ceiling so a real-world propose-fix can complete, then observe how many turns it actually takes. The distance between the new ceiling and the actual run length tells us whether the structural problem is budget (gap is small, ceiling was just too tight) or prompt-vs-allowlist mismatch (gap is large, denials are the real waste). Comparable to PR #2524's Reviewer B 20 → 30 bump pattern. If the follow-up evidence shows ~30% of turns going to denials, a future PR will tighten the prompt or expand the allowlist accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Wheels Bot — Reviewer A
TL;DR: A single-line diagnostic CI bump, well-motivated and clean. One minor issue worth addressing before treating the resulting data as conclusive: the job-level timeout-minutes: 60 cap may fire before the new 100-turn ceiling is reached, meaning a timeout failure would be indistinguishable from a structural prompt-size failure without log inspection.
Correctness
Potential binding constraint: timeout-minutes: 60 may shadow --max-turns 100
File: .github/workflows/bot-propose-fix.yml, line 18
timeout-minutes: 60The job times out after 60 minutes. An Opus 4.7 run with 100 turns on a code-writing task (test suite, multiple tool calls, denied-tool retries) can easily consume 45–70+ minutes. If the job is killed by the GitHub Actions wall clock before it hits 100 turns, the run logs will show Error: The operation was canceled rather than error_max_turns, which:
- makes it hard to distinguish "timed out before doing useful work" from "succeeded and stopped naturally"
- means the diagnostic signal the PR body promises (gap between actual turns used and the ceiling) won't be present
Suggested fix: Either raise timeout-minutes to 90 (or 120) for this diagnostic window, or add a note to the test plan acknowledging this risk and specifying how to distinguish the two failure modes in the run logs.
Commits
Commit message ci: bump propose-fix max-turns 60 → 100 (diagnostic) conforms to commitlint: type ci, no scope (correct for .github/workflows/ changes per CLAUDE.md § "Common scope choices"), subject is sentence-case and ≤ 100 chars. Clean.
Docs
No user-facing behavior changed. No .ai/ or guide update required.
Security
No secrets, no user-input exposure, no new tool grants in the allowlist. Clean.
The change itself is correct and the PR body's measurement framework (small gap vs. medium gap vs. large gap) is solid. The timeout-minutes concern is the only thing that could make the diagnostic inconclusive rather than wrong.
The 60→100 turn bump unblocks the cliff but doesn't tell us why ~30% of every run goes to permission denials. The action's documentation explicitly points to show_full_output: true as the way to expose per-turn tool calls including which were rejected. Adding the flag temporarily so the next propose-fix run logs every tool Opus tried plus its outcome. Once we know which tools are denied (likely candidates: Task, WebFetch, broader Bash commands), a follow-up PR removes this flag and either expands the allowlist or tightens the prompt to stop reaching for them. Security note: show_full_output exposes tool-call detail to the public Actions log. The action redacts secrets (anthropic_api_key, github_token) at a separate layer regardless of this flag, so credentials don't leak. Tool names and file paths Opus reads will be visible — that's the diagnostic content we want. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wheels Bot — Reviewer B (round 1)A's review is largely correct — the core finding about SycophancyNone detected. A said the change "is correct" and the PR body's measurement framework is "solid," but both statements are factually defensible for a single-line CI bump. A withheld approval, which is the right call given the open timeout concern. False positivesLine 18 citation is wrong. A wrote:
Line 18 of the file is Missed issuesNone detected. The diff is a single-line change to a workflow file. PR targets Verdict alignmentCOMMENTED rather than APPROVED is the right call: there is a concrete, open concern (the job wall clock could cut off the diagnostic before 100 turns) that the author should acknowledge in the test plan even if it is not a blocker for merging. |
There was a problem hiding this comment.
Wheels Bot — Reviewer A
TL;DR: The new commit (a4c5d3b83) adds show_full_output: true to the propose-fix workflow — a well-motivated diagnostic lever given the 15–19 denied-tool observations per run. One blocker: the commit uses an invalid scope (diagnostic) that commitlint will reject at error level. The timeout-minutes: 60 job cap flagged in the prior review (SHA 63a43016a) is also still unaddressed.
Commits
Invalid scope diagnostic in commit a4c5d3b83
ci(diagnostic): show_full_output on propose-fix to expose denied tools
The scope-enum rule in commitlint.config.js is set to [2, 'always', [...]] — error level, always enforced. The scope diagnostic is not in the allowlist:
// commitlint.config.js — full scope allowlist
'scope-enum': [2, 'always', [
'model', 'controller', 'view', 'router', 'middleware', 'migration',
'cli', 'test', 'config', 'di', 'job', 'mailer', 'plugin', 'sse',
'seed', 'docs', 'web', 'web/ui', 'web/landing', 'web/blog',
'web/guides', 'web/api', 'web/starlight', 'deps', 'deps-dev'
]]Per CLAUDE.md § "Commit Message Conventions":
When a change spans multiple layers or doesn't map cleanly, use no scope —
ci: …is a valid bare form. Never invent a scope — commitlint will reject the commit.
The Validate Commit Messages CI check will fail. Fix by amending to drop the scope:
ci: show_full_output on propose-fix to expose denied tools
Then git push --force-with-lease origin peter/bot-propose-fix-turns-100.
Correctness
Unaddressed: timeout-minutes: 60 job cap
.github/workflows/bot-propose-fix.yml, line 18:
timeout-minutes: 60The prior Reviewer A comment (on 63a43016a) flagged this: a 100-turn Opus run on a code-writing task can consume 45–70+ minutes. If the GitHub Actions wall clock fires first, the log shows Error: The operation was canceled, not error_max_turns. The PR body's outcome decision tree depends on reading turns-used and denial counts from the log — that data is absent when the job is killed by the wall clock rather than reaching the turn ceiling.
Adding show_full_output: true increases per-turn log overhead slightly, which marginally worsens this exposure.
No follow-up has addressed this. Suggested fix: raise timeout-minutes to 90 or 120 for the diagnostic window, or add a test-plan note explaining how to distinguish a wall-clock kill from a natural run end in the resulting log.
Security
show_full_output: true is a deliberate, documented choice. The PR body inventories what is exposed (tool names, arguments, file contents read by Read) and notes that anthropic_api_key and github_token are redacted at the action layer regardless of this flag. The workflow runs against a public-repo checkout with no .env or mounted secrets. The one-run trade-off is proportionate to the diagnostic need. No objection.
Two items require action before merge: the commitlint violation (blocker — CI will reject the commit) and the unaddressed timeout-minutes concern (the diagnostic will produce ambiguous data if the wall clock fires at 60 min on a 100-turn run).
Wheels Bot — Reviewer B (round 2)A's review correctly identifies the CI-blocking commitlint violation and the unaddressed timeout concern. One persistent citation error; no sycophancy; no missed issues. SycophancyNone detected. A described the change as "well-motivated" — defensible given the three identical failure runs documented in the PR body. A withheld approval and requested changes, which is the right call when commitlint is failing. False positivesLine 18 citation is wrong — same error as round 1. A wrote:
Line 18 is Missed issuesNone detected. The diff is a single functional line ( Verdict alignmentCHANGES_REQUESTED is correct. The |
Summary
Two diagnostic levers in one PR:
max-turns: 60 → 100(permanent — sized for actual work) — the cliff three identical runs hit was 61 turns.show_full_output: true(temporary — remove after next run yields data) — so we can finally see which tools Opus is reaching for that the allowlist rejects.Three identical
error_max_turnsfailures at 61 turns on issue #2530 and earlier on #2526:PR #2528 cut step 9 (doc updates) from 4 locations to 1 CHANGELOG entry. The third run, post-#2528 merge, hit the same wall in the same number of turns at the same cost — disproving the "doc updates were the budget sink" hypothesis. The persistent 15-19 permission denials per run mean Opus is wasting roughly 30% of every run on rejected tool attempts, and we don't currently have visibility into which tools.
Why both levers in one PR
Bumping the budget without expanding visibility means the next 100-turn run will produce more unobservable denials and we'll learn nothing actionable. Adding
show_full_output: truewithout bumping the budget means the run still fails at 61 turns and we don't see how the chain would play out with headroom. Both knobs are needed for one informative experiment.The action's existing log line tells us the visibility flag is the right approach:
Outcome decision tree (after this lands and one run completes)
The follow-up PR depends on what we observe:
show_full_output, doneTask,WebFetch) or add explicit "do not use these" to the prompt; removeshow_full_outputSecurity note on
show_full_outputThe action redacts
anthropic_api_keyandgithub_tokenat a separate layer regardless of this flag, so credentials don't leak into logs. Whatshow_full_outputexposes:Task,WebFetch,Read)For a public repo, every public Actions log is world-readable. We're trading a one-time exposure of "what Opus tried to do on this specific issue" for the ability to fix the budget mismatch. The flag comes back out in a follow-up PR.
Cost expectation
A successful 100-turn Opus run with full output enabled is roughly $5–8 (full output adds some token weight to the trace logs). A failure at 100 would be similar.
Test plan
gh workflow run bot-propose-fix.yml --repo wheels-dev/wheels -F issue-number=2530.show_full_output: trueand either expanding the allowlist or tightening the prompt based on what we learned.Mirror precedent
Budget-bump style mirrors #2524 ("ci: bump Reviewer B max-turns 20 → 30"). The visibility flag is the diagnostic equivalent of
set -xin a flaky shell script.🤖 Generated with Claude Code