Skip to content

ci: bump propose-fix max-turns 60 → 100 + temporary show_full_output (diagnostic)#2531

Merged
bpamiri merged 2 commits intodevelopfrom
peter/bot-propose-fix-turns-100
May 10, 2026
Merged

ci: bump propose-fix max-turns 60 → 100 + temporary show_full_output (diagnostic)#2531
bpamiri merged 2 commits intodevelopfrom
peter/bot-propose-fix-turns-100

Conversation

@bpamiri
Copy link
Copy Markdown
Collaborator

@bpamiri bpamiri commented May 10, 2026

Summary

Two diagnostic levers in one PR:

  1. max-turns: 60 → 100 (permanent — sized for actual work) — the cliff three identical runs hit was 61 turns.
  2. show_full_output: true (temporary — remove after next run yields data) — so we can finally see which tools Opus is reaching for that the allowlist rejects.

Three identical error_max_turns failures at 61 turns on issue #2530 and earlier on #2526:

Run Turns Cost Denials
#25616307210 61 $3.66 15
#25616508963 61 $3.36 19
#25617692327 61 $3.63 18

PR #2528 cut step 9 (doc updates) from 4 locations to 1 CHANGELOG entry. The third run, post-#2528 merge, hit the same wall in the same number of turns at the same cost — disproving the "doc updates were the budget sink" hypothesis. The persistent 15-19 permission denials per run mean Opus is wasting roughly 30% of every run on rejected tool attempts, and we don't currently have visibility into which tools.

Why both levers in one PR

Bumping the budget without expanding visibility means the next 100-turn run will produce more unobservable denials and we'll learn nothing actionable. Adding show_full_output: true without bumping the budget means the run still fails at 61 turns and we don't see how the chain would play out with headroom. Both knobs are needed for one informative experiment.

The action's existing log line tells us the visibility flag is the right approach:

"Running Claude Code via SDK (full output hidden for security)..."
"Rerun in debug mode or enable show_full_output: true in your workflow file for full output."

Outcome decision tree (after this lands and one run completes)

The follow-up PR depends on what we observe:

Observed Diagnosis Follow-up
Succeeds, 60–80 turns, denials < 5 60-turn cap was just too tight; no real waste Drop ceiling to 80, remove show_full_output, done
Succeeds, 80–95 turns, denials 15–20 Denials are 30% waste; need allowlist or prompt tightening Identify top 3 denied tool patterns; expand allowlist (e.g. add Task, WebFetch) or add explicit "do not use these" to the prompt; remove show_full_output
Succeeds, 95–100 turns Same as above, but barely fits at 100 — risky long-term Same fix as above + keep ceiling at 100
Fails at 100 with similar denial rate Architecture mismatch — prompt fundamentally too big for one stage Surgical refactor: split propose-fix into "context-gathering" + "implementation" stages, separate budgets

Security note on show_full_output

The action redacts anthropic_api_key and github_token at a separate layer regardless of this flag, so credentials don't leak into logs. What show_full_output exposes:

  • Tool names called (e.g. Task, WebFetch, Read)
  • Arguments to those calls (file paths, search terms, etc.)
  • Tool results (file contents Opus read, command outputs)

For a public repo, every public Actions log is world-readable. We're trading a one-time exposure of "what Opus tried to do on this specific issue" for the ability to fix the budget mismatch. The flag comes back out in a follow-up PR.

Cost expectation

A successful 100-turn Opus run with full output enabled is roughly $5–8 (full output adds some token weight to the trace logs). A failure at 100 would be similar.

Test plan

  • Merge this PR.
  • Manually re-dispatch propose-fix on issue Debug Panel "Packages" tab does not list registry packages (pipeline test) #2530: gh workflow run bot-propose-fix.yml --repo wheels-dev/wheels -F issue-number=2530.
  • Capture from the run log: turns used, cost, denied-tool list, denied-tool counts.
  • If a draft PR opens: verify the rest of the cascade (bot-update-docs, TDD gate, Reviewer A, Reviewer B).
  • Open follow-up PR removing show_full_output: true and either expanding the allowlist or tightening the prompt based on what we learned.

Mirror precedent

Budget-bump style mirrors #2524 ("ci: bump Reviewer B max-turns 20 → 30"). The visibility flag is the diagnostic equivalent of set -x in a flaky shell script.

🤖 Generated with Claude Code

Three propose-fix attempts on the registry-list bug have hit the
60-turn ceiling identically:

| Run                | Turns | Cost    | Denials |
|--------------------|-------|---------|---------|
| #25616307210       | 61    | $3.66   | 15      |
| #25616508963       | 61    | $3.36   | 19      |
| #25617692327       | 61    | $3.63   | 18      |

The PR #2528 step-9 scope cut (4 doc locations → 1 CHANGELOG entry)
did not move propose-fix off the cliff — the doc updates were not
the dominant scope sink. The persistent 15-19 permission denials per
run suggest Opus is wasting ~30% of the budget on tool calls outside
the allowlist.

This is a diagnostic bump: lift the ceiling so a real-world propose-fix
can complete, then observe how many turns it actually takes. The
distance between the new ceiling and the actual run length tells us
whether the structural problem is budget (gap is small, ceiling was
just too tight) or prompt-vs-allowlist mismatch (gap is large, denials
are the real waste).

Comparable to PR #2524's Reviewer B 20 → 30 bump pattern. If the
follow-up evidence shows ~30% of turns going to denials, a future PR
will tighten the prompt or expand the allowlist accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@wheels-bot wheels-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wheels Bot — Reviewer A

TL;DR: A single-line diagnostic CI bump, well-motivated and clean. One minor issue worth addressing before treating the resulting data as conclusive: the job-level timeout-minutes: 60 cap may fire before the new 100-turn ceiling is reached, meaning a timeout failure would be indistinguishable from a structural prompt-size failure without log inspection.

Correctness

Potential binding constraint: timeout-minutes: 60 may shadow --max-turns 100

File: .github/workflows/bot-propose-fix.yml, line 18

timeout-minutes: 60

The job times out after 60 minutes. An Opus 4.7 run with 100 turns on a code-writing task (test suite, multiple tool calls, denied-tool retries) can easily consume 45–70+ minutes. If the job is killed by the GitHub Actions wall clock before it hits 100 turns, the run logs will show Error: The operation was canceled rather than error_max_turns, which:

  • makes it hard to distinguish "timed out before doing useful work" from "succeeded and stopped naturally"
  • means the diagnostic signal the PR body promises (gap between actual turns used and the ceiling) won't be present

Suggested fix: Either raise timeout-minutes to 90 (or 120) for this diagnostic window, or add a note to the test plan acknowledging this risk and specifying how to distinguish the two failure modes in the run logs.

Commits

Commit message ci: bump propose-fix max-turns 60 → 100 (diagnostic) conforms to commitlint: type ci, no scope (correct for .github/workflows/ changes per CLAUDE.md § "Common scope choices"), subject is sentence-case and ≤ 100 chars. Clean.

Docs

No user-facing behavior changed. No .ai/ or guide update required.

Security

No secrets, no user-input exposure, no new tool grants in the allowlist. Clean.


The change itself is correct and the PR body's measurement framework (small gap vs. medium gap vs. large gap) is solid. The timeout-minutes concern is the only thing that could make the diagnostic inconclusive rather than wrong.

The 60→100 turn bump unblocks the cliff but doesn't tell us why ~30%
of every run goes to permission denials. The action's documentation
explicitly points to show_full_output: true as the way to expose
per-turn tool calls including which were rejected.

Adding the flag temporarily so the next propose-fix run logs every
tool Opus tried plus its outcome. Once we know which tools are
denied (likely candidates: Task, WebFetch, broader Bash commands),
a follow-up PR removes this flag and either expands the allowlist
or tightens the prompt to stop reaching for them.

Security note: show_full_output exposes tool-call detail to the
public Actions log. The action redacts secrets (anthropic_api_key,
github_token) at a separate layer regardless of this flag, so
credentials don't leak. Tool names and file paths Opus reads will
be visible — that's the diagnostic content we want.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bpamiri bpamiri changed the title ci: bump propose-fix max-turns 60 → 100 (diagnostic) ci: bump propose-fix max-turns 60 → 100 + temporary show_full_output (diagnostic) May 10, 2026
@bpamiri bpamiri merged commit b1bda08 into develop May 10, 2026
4 of 5 checks passed
@bpamiri bpamiri deleted the peter/bot-propose-fix-turns-100 branch May 10, 2026 02:48
@wheels-bot
Copy link
Copy Markdown
Contributor

wheels-bot Bot commented May 10, 2026

Wheels Bot — Reviewer B (round 1)

A's review is largely correct — the core finding about timeout-minutes: 60 being a potential binding constraint is substantively sound, the verdict of COMMENTED (not APPROVED) is appropriate, and the commit / docs / security assessments are accurate. One line-number citation error, no missed issues.

Sycophancy

None detected. A said the change "is correct" and the PR body's measurement framework is "solid," but both statements are factually defensible for a single-line CI bump. A withheld approval, which is the right call given the open timeout concern.

False positives

Line 18 citation is wrong. A wrote:

File: .github/workflows/bot-propose-fix.yml, line 18

Line 18 of the file is cancel-in-progress: false. The timeout-minutes: 60 setting is at line 24. The underlying claim is correct — the setting exists and the concern is valid — but the wrong line number would send a reviewer to the wrong place when reading the file.

Missed issues

None detected. The diff is a single-line change to a workflow file. PR targets develop (correct), commit message conforms to commitlint, no new tool grants, no .ai/ update needed.

Verdict alignment

COMMENTED rather than APPROVED is the right call: there is a concrete, open concern (the job wall clock could cut off the diagnostic before 100 turns) that the author should acknowledge in the test plan even if it is not a blocker for merging.

Copy link
Copy Markdown
Contributor

@wheels-bot wheels-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wheels Bot — Reviewer A

TL;DR: The new commit (a4c5d3b83) adds show_full_output: true to the propose-fix workflow — a well-motivated diagnostic lever given the 15–19 denied-tool observations per run. One blocker: the commit uses an invalid scope (diagnostic) that commitlint will reject at error level. The timeout-minutes: 60 job cap flagged in the prior review (SHA 63a43016a) is also still unaddressed.

Commits

Invalid scope diagnostic in commit a4c5d3b83

ci(diagnostic): show_full_output on propose-fix to expose denied tools

The scope-enum rule in commitlint.config.js is set to [2, 'always', [...]] — error level, always enforced. The scope diagnostic is not in the allowlist:

// commitlint.config.js — full scope allowlist
'scope-enum': [2, 'always', [
  'model', 'controller', 'view', 'router', 'middleware', 'migration',
  'cli', 'test', 'config', 'di', 'job', 'mailer', 'plugin', 'sse',
  'seed', 'docs', 'web', 'web/ui', 'web/landing', 'web/blog',
  'web/guides', 'web/api', 'web/starlight', 'deps', 'deps-dev'
]]

Per CLAUDE.md § "Commit Message Conventions":

When a change spans multiple layers or doesn't map cleanly, use no scopeci: … is a valid bare form. Never invent a scope — commitlint will reject the commit.

The Validate Commit Messages CI check will fail. Fix by amending to drop the scope:

ci: show_full_output on propose-fix to expose denied tools

Then git push --force-with-lease origin peter/bot-propose-fix-turns-100.

Correctness

Unaddressed: timeout-minutes: 60 job cap

.github/workflows/bot-propose-fix.yml, line 18:

timeout-minutes: 60

The prior Reviewer A comment (on 63a43016a) flagged this: a 100-turn Opus run on a code-writing task can consume 45–70+ minutes. If the GitHub Actions wall clock fires first, the log shows Error: The operation was canceled, not error_max_turns. The PR body's outcome decision tree depends on reading turns-used and denial counts from the log — that data is absent when the job is killed by the wall clock rather than reaching the turn ceiling.

Adding show_full_output: true increases per-turn log overhead slightly, which marginally worsens this exposure.

No follow-up has addressed this. Suggested fix: raise timeout-minutes to 90 or 120 for the diagnostic window, or add a test-plan note explaining how to distinguish a wall-clock kill from a natural run end in the resulting log.

Security

show_full_output: true is a deliberate, documented choice. The PR body inventories what is exposed (tool names, arguments, file contents read by Read) and notes that anthropic_api_key and github_token are redacted at the action layer regardless of this flag. The workflow runs against a public-repo checkout with no .env or mounted secrets. The one-run trade-off is proportionate to the diagnostic need. No objection.


Two items require action before merge: the commitlint violation (blocker — CI will reject the commit) and the unaddressed timeout-minutes concern (the diagnostic will produce ambiguous data if the wall clock fires at 60 min on a 100-turn run).

@wheels-bot
Copy link
Copy Markdown
Contributor

wheels-bot Bot commented May 10, 2026

Wheels Bot — Reviewer B (round 2)

A's review correctly identifies the CI-blocking commitlint violation and the unaddressed timeout concern. One persistent citation error; no sycophancy; no missed issues.

Sycophancy

None detected. A described the change as "well-motivated" — defensible given the three identical failure runs documented in the PR body. A withheld approval and requested changes, which is the right call when commitlint is failing.

False positives

Line 18 citation is wrong — same error as round 1. A wrote:

.github/workflows/bot-propose-fix.yml, line 18

Line 18 is cancel-in-progress: false. The timeout-minutes: 60 setting is at line 24. B (round 1) flagged this identical citation error on the prior commit (63a43016a). A repeated the same wrong line number here. The underlying concern is accurate; the citation is not.

Missed issues

None detected. The diff is a single functional line (show_full_output: true) plus four comment lines. A covered the two substantive concerns (commitlint scope, wall-clock timeout). CI confirms the Validate Commit Messages check is failing, validating A's primary finding.

Verdict alignment

CHANGES_REQUESTED is correct. The diagnostic scope is not in commitlint.config.js's scope-enum allowlist, and CI has confirmed the failure. A's framing of the timeout concern as also "requiring action before merge" is somewhat overstated — it is a diagnostic risk worth noting, not a CI-enforced gate — but it does not undermine the verdict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant