Skip to content

review/adversarial-review Step 3 system-instruction hardening — required H2 headings unreliable in Codex output #1

@kiki830621

Description

@kiki830621

Problem

The producer skills' Step 3 system-instructions request specific H2 section headings, but Codex does not emit them reliably — it's LLM prose and sometimes omits or restructures the headings.

  • plugins/codex-pro/skills/review/SKILL.md Step 3 asks Codex to begin with ## Summary then ## Findings. Observed during the v0.4.x e2e cycle: Codex sometimes omits ## Summary (or both).
  • plugins/codex-pro/skills/adversarial-review/SKILL.md Step 3 asks for 4 mandatory H2 sections (## Assumptions Challenged / ## Failure Modes / ## Alternative Approaches / ## Trade-off Counterarguments). These hold up better (forceful "exactly four sections, each non-empty" wording), but the same fragility class applies.

Because of this non-determinism, tests/e2e.sh currently checks these headings with verify_substring_warn (best-effort warning, not a hard failure) instead of verify_substring. That means a real regression where Codex stops emitting the required structure would pass the Layer 3 e2e silently.

Context: surfaced during the e2e-skill-invocation-tests cycle (commit e296d4a) and re-noted in the config-profile-mechanism cycle (v0.5.0, commit 196be77). The verify_substring_warn helper was introduced specifically as a stopgap for this — see its comment in tests/e2e.sh.

Type

refactor / enhancement (prompt-engineering hardening; no behavior change to the skill's contract, only to how reliably Codex satisfies it)

Expected

Codex reliably emits the required H2 headings for both producers, so the e2e checks can be promoted from verify_substring_warn back to hard verify_substring across all skill × scenario combinations.

Likely approach (to be confirmed at diagnose):

  • Strengthen review's Step 3 wording to match adversarial-review's forceful pattern ("Your output MUST begin with exactly these two headings, verbatim: ## Summary then ## Findings").
  • Add an explicit format scaffold / one-shot example block in the system instructions.
  • Re-run the full e2e matrix (12 combos) to confirm the headings appear reliably before promoting the assertions.

Actual

  • review Step 3 uses softer "Begin with a one-paragraph Summary … Follow with a Findings list" phrasing — Codex interprets loosely.
  • e2e heading checks are verify_substring_warn (non-failing) for mixed / binary / oversize / empty-repo / with-profile scenarios; only all-empty (target_invalid pre-flight, written by Claude not Codex) and frontmatter markers are hard-verified.

Impact

  • Test integrity: Layer 3 e2e cannot currently catch a regression in the producers' output structure — the headings are advisory. This weakens the e2e gate that exists to catch SKILL→runtime drift.
  • Scope: plugins/codex-pro/skills/review/SKILL.md (Step 3), plugins/codex-pro/skills/adversarial-review/SKILL.md (Step 3, lighter touch), tests/e2e.sh (promote verify_substring_warnverify_substring once reliable).
  • Cost: verification requires re-running the e2e matrix (real codex-call quota; ~12 combos). Bundle with the next producer change to amortize, or run standalone as a release-gate validation.
  • Priority: P2 — quality/test-integrity improvement, not a user-facing bug. The skill contract is unchanged; this hardens how reliably Codex honors it.

Non-Goals

  • Not changing the result-file frontmatter contract or the 4-section perspectival structure of adversarial-review.
  • Not changing codex-call flags or the profile mechanism (v0.5).

Current Status

Phase: diagnosed
Complexity: Spectra (route: spectra-discuss)
Updated: 2026-06-08

Diagnosis posted (root cause: review Step 3 instructions request "a Summary" / "a Findings list" as prose, never the literal ## Summary / ## Findings H2 tokens the contract + e2e expect). Next: /spectra-discuss to align open questions before proposal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions