Problem
The producer skills' Step 3 system-instructions request specific H2 section headings, but Codex does not emit them reliably — it's LLM prose and sometimes omits or restructures the headings.
plugins/codex-pro/skills/review/SKILL.md Step 3 asks Codex to begin with ## Summary then ## Findings. Observed during the v0.4.x e2e cycle: Codex sometimes omits ## Summary (or both).
plugins/codex-pro/skills/adversarial-review/SKILL.md Step 3 asks for 4 mandatory H2 sections (## Assumptions Challenged / ## Failure Modes / ## Alternative Approaches / ## Trade-off Counterarguments). These hold up better (forceful "exactly four sections, each non-empty" wording), but the same fragility class applies.
Because of this non-determinism, tests/e2e.sh currently checks these headings with verify_substring_warn (best-effort warning, not a hard failure) instead of verify_substring. That means a real regression where Codex stops emitting the required structure would pass the Layer 3 e2e silently.
Context: surfaced during the e2e-skill-invocation-tests cycle (commit e296d4a) and re-noted in the config-profile-mechanism cycle (v0.5.0, commit 196be77). The verify_substring_warn helper was introduced specifically as a stopgap for this — see its comment in tests/e2e.sh.
Type
refactor / enhancement (prompt-engineering hardening; no behavior change to the skill's contract, only to how reliably Codex satisfies it)
Expected
Codex reliably emits the required H2 headings for both producers, so the e2e checks can be promoted from verify_substring_warn back to hard verify_substring across all skill × scenario combinations.
Likely approach (to be confirmed at diagnose):
- Strengthen review's Step 3 wording to match adversarial-review's forceful pattern ("Your output MUST begin with exactly these two headings, verbatim:
## Summary then ## Findings").
- Add an explicit format scaffold / one-shot example block in the system instructions.
- Re-run the full e2e matrix (12 combos) to confirm the headings appear reliably before promoting the assertions.
Actual
- review Step 3 uses softer "Begin with a one-paragraph Summary … Follow with a Findings list" phrasing — Codex interprets loosely.
- e2e heading checks are
verify_substring_warn (non-failing) for mixed / binary / oversize / empty-repo / with-profile scenarios; only all-empty (target_invalid pre-flight, written by Claude not Codex) and frontmatter markers are hard-verified.
Impact
- Test integrity: Layer 3 e2e cannot currently catch a regression in the producers' output structure — the headings are advisory. This weakens the e2e gate that exists to catch SKILL→runtime drift.
- Scope:
plugins/codex-pro/skills/review/SKILL.md (Step 3), plugins/codex-pro/skills/adversarial-review/SKILL.md (Step 3, lighter touch), tests/e2e.sh (promote verify_substring_warn → verify_substring once reliable).
- Cost: verification requires re-running the e2e matrix (real codex-call quota; ~12 combos). Bundle with the next producer change to amortize, or run standalone as a release-gate validation.
- Priority: P2 — quality/test-integrity improvement, not a user-facing bug. The skill contract is unchanged; this hardens how reliably Codex honors it.
Non-Goals
- Not changing the result-file frontmatter contract or the 4-section perspectival structure of adversarial-review.
- Not changing codex-call flags or the profile mechanism (v0.5).
Current Status
Phase: diagnosed
Complexity: Spectra (route: spectra-discuss)
Updated: 2026-06-08
Diagnosis posted (root cause: review Step 3 instructions request "a Summary" / "a Findings list" as prose, never the literal ## Summary / ## Findings H2 tokens the contract + e2e expect). Next: /spectra-discuss to align open questions before proposal.
Problem
The producer skills' Step 3 system-instructions request specific H2 section headings, but Codex does not emit them reliably — it's LLM prose and sometimes omits or restructures the headings.
plugins/codex-pro/skills/review/SKILL.mdStep 3 asks Codex to begin with## Summarythen## Findings. Observed during the v0.4.x e2e cycle: Codex sometimes omits## Summary(or both).plugins/codex-pro/skills/adversarial-review/SKILL.mdStep 3 asks for 4 mandatory H2 sections (## Assumptions Challenged/## Failure Modes/## Alternative Approaches/## Trade-off Counterarguments). These hold up better (forceful "exactly four sections, each non-empty" wording), but the same fragility class applies.Because of this non-determinism,
tests/e2e.shcurrently checks these headings withverify_substring_warn(best-effort warning, not a hard failure) instead ofverify_substring. That means a real regression where Codex stops emitting the required structure would pass the Layer 3 e2e silently.Type
refactor / enhancement (prompt-engineering hardening; no behavior change to the skill's contract, only to how reliably Codex satisfies it)
Expected
Codex reliably emits the required H2 headings for both producers, so the e2e checks can be promoted from
verify_substring_warnback to hardverify_substringacross all skill × scenario combinations.Likely approach (to be confirmed at diagnose):
## Summarythen## Findings").Actual
verify_substring_warn(non-failing) formixed/binary/oversize/empty-repo/with-profilescenarios; onlyall-empty(target_invalid pre-flight, written by Claude not Codex) and frontmatter markers are hard-verified.Impact
plugins/codex-pro/skills/review/SKILL.md(Step 3),plugins/codex-pro/skills/adversarial-review/SKILL.md(Step 3, lighter touch),tests/e2e.sh(promoteverify_substring_warn→verify_substringonce reliable).Non-Goals
Current Status
Phase: diagnosed
Complexity: Spectra (route: spectra-discuss)
Updated: 2026-06-08
Diagnosis posted (root cause: review Step 3 instructions request "a Summary" / "a Findings list" as prose, never the literal
## Summary/## FindingsH2 tokens the contract + e2e expect). Next:/spectra-discussto align open questions before proposal.