Maps common CI failure signatures to exact replay commands, key artifact paths, and remediation steps.
Bead: bd-1f42.8.9 Policy: docs/testing-policy.md QA Runbook: docs/qa-runbook.md
# 1. Replay all failed suites from a previous E2E run
./scripts/e2e/run_all.sh --rerun-from tests/e2e_results/<ts>/summary.json
# 2. Replay a single suite
cargo test --test <suite_name> -- --nocapture
# 3. Replay a single test function
cargo test --test <suite_name> <test_name> -- --nocapture
# 4. Replay with debug output
RUST_LOG=debug RUST_BACKTRACE=1 cargo test --test <suite_name> -- --nocapture
# 5. Replay CI gate failures
cargo test --test ci_full_suite_gate -- full_suite_gate --nocapture --exactSignature: Completion audit closeout gate self-test fails or
scripts/check_completion_audit_gate.py exits non-zero for a closeout artifact.
Artifacts:
- Completion audit JSON from
scripts/build_completion_audit.py tests/fixtures/completion_audit_gate/scenarios.jsontests/fixtures/completion_audit_gate/goldens/*.json
Replay:
python3 scripts/check_completion_audit_gate.py --self-test \
--generated-at 2026-01-02T03:04:05+00:00
python3 scripts/check_completion_audit_gate.py \
--audit-json docs/evidence/<completion-audit>.jsonRemediation:
- Read
blockers[*].kindandoperator_next_actionsin the gate JSON. - For
missing_push, push the closeout commit and regenerate the audit. - For
failed_command, rerun the exact command after fixing the failure and include the passing transcript. - For
proxy_only_evidence, replace narrative or indirect proof with direct command, artifact, git, or Beads evidence. - For
missing_artifact, create or correct the artifact path and rerun the audit. - For
unresolved_gap, close the gap or create an active owner bead before closeout.
The gate is intentionally lightweight: it reads existing JSON only. It must not run Cargo, call live providers, mutate Beads, send Agent Mail, launch RCH, or delete files. Any future heavy validation referenced by the audit must use the repo's RCH-backed command form.
Signature: non_mock_compliance_gate ... FAILED
Artifacts:
docs/non-mock-rubric.json(rubric thresholds)docs/test_double_inventory.json(current inventory)
Replay:
cargo test --test non_mock_compliance_gate -- --nocaptureRemediation:
- Check which module fell below its floor threshold.
- Review
docs/non-mock-rubric.jsonfor the affected module's floor values. - Migrate mock/stub usages to VCR or real implementations.
- See
docs/testing-policy.md"Allowlisted Exceptions" for the approval process.
Signature: conformance_must_pass_gate ... FAILED
Artifacts:
tests/ext_conformance/reports/gate/must_pass_gate_verdict.jsontests/ext_conformance/reports/conformance_summary.json
Replay:
cargo test --test ext_conformance_generated --features ext-conformance \
-- conformance_must_pass_gate --nocapture --exactRemediation:
- Check
conformance_summary.jsonfor pass/fail/N/A counts. - Look for newly failing extensions in the summary.
- Common causes: missing node shim, new hostcall not dispatched, QuickJS module resolution.
- See
docs/conformance-operator-playbook.mdfor debugging workflows.
Signature: cross_platform_matrix ... FAILED
Artifacts:
tests/cross_platform_reports/linux/platform_report.json
Replay:
cargo test --test ci_cross_platform_matrix -- cross_platform_matrix --nocapture --exactRemediation:
- Read the platform report to identify which checks failed.
- Common causes: missing system dependencies, path separator issues, permission differences.
- Fix the platform-specific code and re-run.
Signature: build_evidence_bundle ... FAILED
Artifacts:
tests/evidence_bundle/index.json
Replay:
cargo test --test ci_evidence_bundle -- build_evidence_bundle --nocapture --exactRemediation:
- Evidence bundle validates that all required artifacts exist and are well-formed.
- Check for missing artifact files (summary.json, environment.json, etc.).
- Ensure
scripts/e2e/run_all.shcompleted all post-run phases.
Signature: certification/readiness checks fail on missing
extension_remediation_backlog.json or schema mismatch.
Artifacts:
tests/full_suite_gate/certification_dossier.jsontests/full_suite_gate/extension_remediation_backlog.jsontests/full_suite_gate/extension_remediation_backlog.md
Replay:
cargo test --test qa_certification_dossier -- certification_dossier --nocapture --exactRemediation:
- Regenerate certification artifacts and backlog in a single run (command above).
- Verify backlog schema is
pi.qa.extension_remediation_backlog.v1. - Ensure the backlog summary/entries are non-empty when conformance failures exist.
- Re-run dependent gates after artifact refresh.
Signature: suite_classification gate fails
Artifacts:
tests/suite_classification.toml
Replay:
cargo test --test ci_full_suite_gate -- full_suite_gate --nocapture --exactRemediation:
- A new test file in
tests/is not listed intests/suite_classification.toml. - Classify the file into
[suite.unit],[suite.vcr], or[suite.e2e]. - Keep entries sorted alphabetically within each suite.
Signature: waiver_lifecycle_audit ... FAILED or waiver_lifecycle gate fails
Artifacts:
tests/full_suite_gate/waiver_audit.jsontests/suite_classification.toml(waiver entries)
Replay:
cargo test --test ci_full_suite_gate -- waiver_lifecycle_audit --nocapture --exactRemediation:
- Check
waiver_audit.jsonfor expired or invalid waivers. - Expired waivers must be either renewed (new
expiresdate, max +30 days) or removed. - Invalid waivers are missing required fields; add all 7 fields.
- See
docs/qa-runbook.md"Waiver Lifecycle" for the full schema.
Signature: provider_streaming or e2e_provider_streaming test failures
Artifacts:
tests/fixtures/vcr/(VCR cassettes)
Replay:
# VCR-backed
VCR_MODE=playback cargo test --test provider_streaming -- --nocapture
# E2E
cargo test --test e2e_provider_streaming -- --nocaptureRemediation:
- Check if VCR cassettes are stale (model IDs changed, API format updated).
- Verify
api_key: Some("vcr-playback".to_string())inStreamOptions. - For URL mismatches: VCR uses strict URL matching; ensure model ID in test matches cassette.
Signature: e2e_tui tests fail
Artifacts:
- E2E results directory
Replay:
cargo test --test e2e_tui -- --nocaptureRemediation:
- TUI tests require tmux. Verify
tmuxis installed and accessible. - Set
PI_TEST_MODE=1for deterministic rendering. - VCR cassettes provide provider responses; check cassette freshness.
Signature: Inconsistent pass/fail across runs on the same commit.
Replay:
# Run with same parallelism as CI
cargo test --test <suite> -- --nocapture --test-threads=1
# Multiple runs to detect flakiness
for i in $(seq 1 5); do
cargo test --test <suite> -- <test_name> --exact --nocapture || echo "FAIL on run $i"
doneRemediation:
- Classify the flake per taxonomy (FLAKE-TIMING/ENV/NET/RES/EXT/LOGIC).
- Add quarantine entry to
tests/suite_classification.toml. - See
docs/testing-policy.md"Flaky-Test Quarantine" for the full lifecycle.
This section defines the operator workflow for parity regressions that threaten strict drop-in claims.
tests/e2e_results/<ts>/triage_diff.jsonhasstatus = "regression"orsummary.regression_count > 0.tests/full_suite_gate/full_suite_verdict.jsonshows a failed blocking gate affecting parity/test-log evidence (e2e_log_contract,suite_classification,conformance_pass_rate,evidence_bundle, or other blocking gate).docs/evidence/dropin-certification-verdict.jsonis missing or hasoverall_verdict != CERTIFIEDwhen release messaging needs strict drop-in wording.- CI parity suite gate fails (
PARITY GATE FAIL) in.github/workflows/ci.yml.
| Severity | Criteria | Response target |
|---|---|---|
SEV-1 |
Blocking parity regression on main or release cut path |
Assign owner + post incident context within 30 minutes |
SEV-2 |
New regression in PR/branch with no current release block | Assign owner + post context within 4 hours |
SEV-3 |
Evidence/documentation drift without active behavior regression | Assign owner + post context within 1 business day |
Collect and attach these artifacts to the incident bead and Agent Mail thread:
tests/e2e_results/<ts>/summary.jsontests/e2e_results/<ts>/triage_diff.jsontests/e2e_results/<ts>/replay_bundle.jsontests/e2e_results/<ts>/failure_diagnostics_index.jsontests/full_suite_gate/full_suite_verdict.jsontests/full_suite_gate/full_suite_events.jsonltests/full_suite_gate/full_suite_report.mdtests/evidence_bundle/index.jsondocs/contracts/dropin-certification-contract.jsondocs/evidence/dropin-certification-verdict.json(if present in the run)
- Capture a reproducible baseline diff:
./scripts/e2e/run_all.sh --profile ci \
--diff-from tests/e2e_results/<baseline-ts>/summary.json- Run gate replay commands for failing lanes:
cargo test --test ci_full_suite_gate -- full_suite_gate --nocapture --exact
cargo test --test ci_full_suite_gate -- preflight_fast_fail --nocapture --exact
cargo test --test ci_full_suite_gate -- full_certification --nocapture --exact- Extract exact per-gate remediation commands from the verdict:
python3 - <<'PY'
import json
from pathlib import Path
p = Path("tests/full_suite_gate/full_suite_verdict.json")
if not p.exists():
raise SystemExit("missing full_suite_verdict.json")
data = json.loads(p.read_text(encoding="utf-8"))
for gate in data.get("gates", []):
if gate.get("status") == "fail":
print(f"{gate['id']}: {gate.get('reproduce_command', 'N/A')}")
PY-
Create/update the owning bead and notify the swarm in-thread (
thread_id = bead id) with: failing gate IDs,triage_diff.status, topranked_diagnostics, and one-command replay. -
Apply fix and rerun:
./scripts/e2e/run_all.sh --rerun-from tests/e2e_results/<ts>/summary.json
cargo test --test ci_full_suite_gate -- full_suite_gate --nocapture --exact- Close only when all exit criteria are true:
triage_diff.statusis notregression.- Blocking full-suite gates pass.
- Drop-in wording guard is satisfied (
overall_verdict = CERTIFIED) for release claims. - Bead + Agent Mail thread contain artifact links and final remediation note.
- If unresolved beyond response target: escalate to maintainer in the same bead thread.
- If release train is active and
SEV-1persists: freeze strict drop-in messaging until parity incident is closed. - Use rollback mode (
CI_GATE_PROMOTION_MODE=rollback) only as a short-lived emergency control; record rationale + expiry in the incident bead and restorestrictafter fix.
When the incident affects performance certification (not only drop-in wording), also apply this fail-closed checklist:
- Treat missing/stale PERF-3X artifacts as blocking failures:
tests/full_suite_gate/perf3x_bead_coverage_audit.jsontests/full_suite_gate/practical_finish_checkpoint.jsontests/perf/reports/budget_summary.jsontests/perf/reports/perf_comparison.jsontests/perf/reports/stress_triage.jsontests/perf/reports/parameter_sweeps.json
- Attach
tests/full_suite_gate/certification_events.jsonlplus perf event streams:tests/perf/reports/budget_events.jsonltests/perf/reports/perf_comparison_events.jsonltests/perf/reports/stress_events.jsonltests/perf/reports/parameter_sweeps_events.jsonl
- Use the log-query playbooks in
docs/qa-runbook.mdunder PERF-3X Regression Triage (bd-3ar8v.6.4) for attribution and replay targeting. - Do not close the incident until detection, attribution, mitigation, and verification are all recorded in the bead thread with artifact links.
Signature: full_suite_verdict.json contains gate parameter_sweeps_integrity with
status = "fail" and detail mentioning parameter_sweeps.* schema/readiness/source contract drift.
Artifacts:
tests/perf/reports/parameter_sweeps.jsontests/perf/reports/parameter_sweeps_events.jsonltests/perf/reports/phase1_matrix_validation.jsontests/full_suite_gate/full_suite_verdict.json
Replay:
rch exec -- cargo test --test release_evidence_gate -- \
parameter_sweeps_contract_links_phase1_matrix_and_readiness --nocapture --exact
rch exec -- cargo test --test ci_full_suite_gate -- full_suite_gate --nocapture --exactRemediation:
- Enforce artifact schema
pi.perf.parameter_sweeps.v1. - Enforce
source_identitycontract (source_artifact = "phase1_matrix_validation"andsource_artifact_pathreferencesphase1_matrix_validation.json). - Enforce readiness invariants:
status = ready->ready_for_phase5 = trueandblocking_reasons = []status = blocked->ready_for_phase5 = falseand non-emptyblocking_reasons
- Ensure
selected_defaultsare positive integers andsweep_plan.dimensionsincludes required knobs. - Re-run full-suite gate and re-attach updated
parameter_sweepsartifact + event stream.
Signature: gate practical_finish_checkpoint fails with detail like
technical PERF-3X issue(s) still open or Fail-closed practical-finish source read error.
Artifacts:
tests/full_suite_gate/practical_finish_checkpoint.json.beads/issues.jsonl(or fallback.beads/beads.base.jsonl)tests/full_suite_gate/full_suite_verdict.jsontests/full_suite_gate/certification_events.jsonl
Replay:
rch exec -- cargo test --test ci_full_suite_gate -- \
practical_finish_report_fails_when_technical_open_issues_remain --nocapture --exact
rch exec -- cargo test --test release_readiness -- practical_finish_checkpoint_ -- --nocapture
rch exec -- cargo test --test ci_full_suite_gate -- full_suite_gate --nocapture --exactRemediation:
- Verify
practical_finish_checkpoint.jsonschema ispi.perf3x.practical_finish_checkpoint.v1. - Ensure required contract fields are coherent:
status, non-emptydetail,technical_completion_reached,residual_open_scope, and count equality (open_perf3x_count = technical_open_count + docs_or_report_open_count). - Close or re-scope remaining technical PERF-3X issues; only docs/report residuals are allowed.
- Re-run full-suite gate and attach refreshed checkpoint artifact + certification events before closure.
Signature: claim-contract validation reports tier-order drift (or missing canonical sequence) for:
TIER-1-EXTENSION-HOST-PARITYTIER-2-TARGETED-RUNTIME-PARITYTIER-3-FULL-NODE-BUN-REPLACEMENT
Artifacts:
docs/franken-node-claim-gating-contract.jsontests/full_suite_gate/franken_node_claim_verdict.jsontests/full_suite_gate/practical_finish_checkpoint.json
Replay:
rch exec -- cargo test --test franken_node_claim_contract -- \
franken_node_claim_contract_declares_expected_tier_order -- --nocapture
rch exec -- cargo test --test release_evidence_gate -- \
franken_node_claim_contract_is_present_and_valid --nocapture --exactRemediation:
- Restore canonical tier order in
claim_tiersto Tier-1 -> Tier-2 -> Tier-3. - Ensure every tier still carries non-empty
required_evidence,allowed_claim_language, andforbidden_claim_language. - Keep strict replacement gating fail-closed (
overall_verdict = CERTIFIEDrequired) and regeneratefranken_node_claim_verdict.jsonbefore incident closure.
Signature: kernel-extraction boundary contract/report drift is detected in manifest validation output, especially missing module ownership coverage, duplicate ownership, or banned cross-boundary pair regressions.
Artifacts:
docs/franken-node-kernel-extraction-boundary-manifest.jsontests/full_suite_gate/franken_node_kernel_boundary_drift_report.jsontests/full_suite_gate/practical_finish_checkpoint.json
Replay:
rch exec -- cargo test --test franken_node_kernel_extraction_boundary_manifest -- \
kernel_boundary_manifest_ -- --nocapture
rch exec -- cargo test --test qa_docs_policy_validation -- \
franken_node_mission_contract_tier_mapping_declares_required_checks_and_phase6_beads -- --nocaptureRemediation:
- Ensure drift report checks remain present and fail-closed:
kernel_boundary.all_modules_mapped_or_deferred,kernel_boundary.no_duplicate_domain_ownership, andkernel_boundary.banned_cross_boundary_pairs_absent. - Restore strict tier evidence linkage tokens in mission contract:
docs/franken-node-kernel-extraction-boundary-manifest.jsonandtests/full_suite_gate/franken_node_kernel_boundary_drift_report.json. - Re-run the replay commands and attach refreshed artifacts before clearing the incident.
Signature: semantic compatibility harness fails or hard-skips because a real
Node runtime is unavailable, or Bun's node shim is incorrectly treated as
Node. Typical signals include Node.js not found and
SKIP: generate_compatibility_matrix requires both Node.js and Bun.
Artifacts:
tests/franken_node_compat_harness.rstests/franken_node_compat/fixtures/tests/full_suite_gate/full_suite_verdict.json
Replay:
rch exec -- cargo test --test franken_node_compat_harness -- \
node_detection_rejects_bun_node_shim_when_present -- --nocapture
rch exec -- cargo test --test franken_node_compat_harness -- \
generate_compatibility_matrix -- --nocaptureRemediation:
- Keep
find_node()andis_real_node()aligned with fail-closed detection: Bun's/home/ubuntu/.bun/bin/nodeshim must not pass as real Node. - Preserve deterministic skip diagnostics when Node/Bun are unavailable:
SKIP: Node.js not found on this machine,SKIP: Bun not found on this machine, andSKIP: generate_compatibility_matrix requires both Node.js and Bun. - After runtime availability is corrected, re-run harness replay commands and attach refreshed verdict artifacts before clearing the incident.
The primary run summary. Key fields:
| Field | Meaning |
|---|---|
failed_names |
List of failed E2E suite names |
failed_unit_names |
List of failed unit target names |
passed_suites / total_suites |
E2E suite pass rate |
replay_bundle.one_command_replay |
One-command to replay all failures |
triage_diff |
Baseline comparison (if --diff-from was used) |
Consolidated replay commands and environment context:
| Field | Meaning |
|---|---|
one_command_replay |
Single command to reproduce all failures |
environment.profile |
Run profile (quick/focused/ci/full) |
environment.vcr_mode |
VCR mode during the run |
environment.git_sha |
Git commit of the run |
failed_suites[].cargo_replay |
Per-suite cargo test command |
failed_suites[].targeted_replay |
Single-test cargo command |
failed_suites[].digest_path |
Path to per-suite failure digest |
Per-suite failure analysis:
| Field | Meaning |
|---|---|
root_cause_class |
Classification: assertion_failure, timeout, panic, etc. |
impacted_scenario_ids |
List of failed test names |
first_failing_assertion |
Location and message of first failure |
remediation_pointer.replay_command |
Runner-level replay |
remediation_pointer.suite_replay_command |
Suite-level cargo test |
remediation_pointer.targeted_test_replay_command |
Single-test cargo test |
Baseline comparison for regressions:
| Field | Meaning |
|---|---|
status |
regression, stable, or known_failures_only |
summary.regression_count |
New failures vs baseline |
ranked_diagnostics |
Severity-ranked list of changes |
recommended_commands.runner_repro_command |
Replay all problem targets |
recommended_commands.ranked_repro_commands |
Prioritized per-target commands |
The CI runner supports sharding for parallel execution:
# Run shard 0 of 3 for E2E suites
./scripts/e2e/run_all.sh --profile ci --shard-kind suite --shard-index 0 --shard-total 3
# Run shard 1 of 4 for unit targets
./scripts/e2e/run_all.sh --profile ci --shard-kind unit --shard-index 1 --shard-total 4Shard context is captured in:
environment.json:shard.kind,shard.index,shard.totalsummary.json: same shard fieldsreplay_bundle.json:environment.shard_kind,shard_index,shard_total
To replay a specific shard's failures, use the --rerun-from flag with that shard's
summary.json.