review/adversarial-review Step 3 system-instruction hardening — required H2 headings unreliable in Codex output

## Problem

The producer skills' Step 3 system-instructions request specific H2 section headings, but Codex does not emit them reliably — it's LLM prose and sometimes omits or restructures the headings.

- `plugins/codex-pro/skills/review/SKILL.md` Step 3 asks Codex to begin with `## Summary` then `## Findings`. Observed during the v0.4.x e2e cycle: Codex sometimes omits `## Summary` (or both).
- `plugins/codex-pro/skills/adversarial-review/SKILL.md` Step 3 asks for 4 mandatory H2 sections (`## Assumptions Challenged` / `## Failure Modes` / `## Alternative Approaches` / `## Trade-off Counterarguments`). These hold up better (forceful "exactly four sections, each non-empty" wording), but the same fragility class applies.

Because of this non-determinism, `tests/e2e.sh` currently checks these headings with `verify_substring_warn` (best-effort **warning**, not a hard failure) instead of `verify_substring`. That means a real regression where Codex stops emitting the required structure would pass the Layer 3 e2e silently.

> Context: surfaced during the e2e-skill-invocation-tests cycle (commit `e296d4a`) and re-noted in the config-profile-mechanism cycle (v0.5.0, commit `196be77`). The `verify_substring_warn` helper was introduced specifically as a stopgap for this — see its comment in `tests/e2e.sh`.

## Type

refactor / enhancement (prompt-engineering hardening; no behavior change to the skill's contract, only to how reliably Codex satisfies it)

## Expected

Codex reliably emits the required H2 headings for both producers, so the e2e checks can be promoted from `verify_substring_warn` back to hard `verify_substring` across all skill × scenario combinations.

Likely approach (to be confirmed at diagnose):
- Strengthen review's Step 3 wording to match adversarial-review's forceful pattern ("Your output MUST begin with exactly these two headings, verbatim: `## Summary` then `## Findings`").
- Add an explicit format scaffold / one-shot example block in the system instructions.
- Re-run the full e2e matrix (12 combos) to confirm the headings appear reliably before promoting the assertions.

## Actual

- review Step 3 uses softer "Begin with a one-paragraph Summary … Follow with a Findings list" phrasing — Codex interprets loosely.
- e2e heading checks are `verify_substring_warn` (non-failing) for `mixed` / `binary` / `oversize` / `empty-repo` / `with-profile` scenarios; only `all-empty` (target_invalid pre-flight, written by Claude not Codex) and frontmatter markers are hard-verified.

## Impact

- **Test integrity**: Layer 3 e2e cannot currently catch a regression in the producers' output structure — the headings are advisory. This weakens the e2e gate that exists to catch SKILL→runtime drift.
- **Scope**: `plugins/codex-pro/skills/review/SKILL.md` (Step 3), `plugins/codex-pro/skills/adversarial-review/SKILL.md` (Step 3, lighter touch), `tests/e2e.sh` (promote `verify_substring_warn` → `verify_substring` once reliable).
- **Cost**: verification requires re-running the e2e matrix (real codex-call quota; ~12 combos). Bundle with the next producer change to amortize, or run standalone as a release-gate validation.
- **Priority**: P2 — quality/test-integrity improvement, not a user-facing bug. The skill contract is unchanged; this hardens how reliably Codex honors it.

## Non-Goals

- Not changing the result-file frontmatter contract or the 4-section perspectival structure of adversarial-review.
- Not changing codex-call flags or the profile mechanism (v0.5).

---

## Current Status

**Phase**: diagnosed
**Complexity**: Spectra (route: spectra-discuss)
**Updated**: 2026-06-08

Diagnosis posted (root cause: review Step 3 instructions request "a Summary" / "a Findings list" as prose, never the literal `## Summary` / `## Findings` H2 tokens the contract + e2e expect). Next: `/spectra-discuss` to align open questions before proposal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

review/adversarial-review Step 3 system-instruction hardening — required H2 headings unreliable in Codex output #1

Problem

Type

Expected

Actual

Impact

Non-Goals

Current Status

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

review/adversarial-review Step 3 system-instruction hardening — required H2 headings unreliable in Codex output #1

Description

Problem

Type

Expected

Actual

Impact

Non-Goals

Current Status

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions