feat(search): EXP-14 — retrieval-side abstention gate by moralespanitz · Pull Request #8 · atomicmemory/atomicmemory-core

moralespanitz · 2026-04-29T21:01:50Z

EXP-14 — Retrieval-side abstention gate

Target ability: ABS (Abstention)
Status: Implementation complete, tests pass, ready for experimentation
Plan: experiments/exp-14-implementation-plan-2026-04-29.md

What

Adds a post-rerank confidence computation that signals when retrieval results are poorly separated and/or absolutely weak. The BEAM adapter can read this signal and append an abstention instruction block to the answer prompt.

Why

ABS scored 0/2 in the Stage 7 dry-run. Honcho's ABS (36.3) is below the no-memory baseline (60.0) — the rare BEAM ability where memory retrieval actively hurts. EXP-14 is the only Phase-2 category where success lands above both Honcho and baseline.

How

Gate on similarity, not score** — score is rewritten by RRF, cross-encoder, MMR, and boosts mid-pipeline. similarity is the only stable, scale-invariant signal.
Inserted between applyRankingProtectionStages and selectAndExpandCandidates in search-pipeline.ts — after cross-encoder (top is final), before MMR (which would artificially lower top-1).
Four new RuntimeConfig fields, all default-off:
- retrievalConfidenceGateEnabled
- retrievalConfidenceMarginNormalizer (default 0.05)
- retrievalConfidenceSimilarityNormalizer (default 0.5)
- retrievalConfidenceFloor (default 0.3)
All four are allowlisted in INTERNAL_POLICY_CONFIG_FIELDS so the BEAM adapter A/Bs per-ability via config_override.

Files changed

File	Change
`src/services/retrieval-confidence-gate.ts`	New — confidence computation module (~70 LOC)
`src/services/search-pipeline.ts`	Add fields to `SearchPipelineRuntimeConfig`, invoke gate, change return type, emit trace event
`src/services/memory-search.ts`	Thread `retrievalConfidence` through to `RetrievalResult`
`src/services/memory-service-types.ts`	Add `retrievalConfidence?: RetrievalConfidence` to `RetrievalResult`
`src/routes/memories.ts`	Emit `retrieval_confidence` in `formatSearchResponse`
`src/config.ts`	Add 4 fields to `RuntimeConfig`, env loaders, `INTERNAL_POLICY_CONFIG_FIELDS`
`src/app/runtime-container.ts`	Add 4 fields to `CoreRuntimeConfig`
`src/services/__tests__/retrieval-confidence-gate.test.ts`	New — 10 unit tests

Division of labor

This PR (core): confidence computation, tracing, response field
Follow-up PR (benchmarks): BEAM adapter reads retrieval_confidence.low_confidence and appends abstention prompt block

Tests

10 unit tests, all passing
npx tsc --noEmit clean

Rollback

Set retrievalConfidenceGateEnabled: false (default). Server emits no retrieval_confidence; adapter sees undefined; no prompt block appended. Bit-identical to pre-EXP-14.

Extraction LLM was truncating JSON output at ~14 KB during BEAM Sprint 2 CR mini-slice runs on dense 10-turn chunks. Server log showed: [extractFacts] JSON parse failed (Unterminated string in JSON at position 14152 ...); attempting repair across 6 chunks of one ingest, causing iter 7 (first attempt) to crash on conv-3. The Anthropic max_tokens budget defaults to 4096 in extraction.ts. Going to 8192 doubles the headroom for JSON output without changing any other behavior. Cost impact is marginal (Anthropic bills only for tokens actually generated; rare for extraction to use the full 8192). Validation: server is running with this change locally; iter 7 v3 N=3 full-ingest reruns succeed without truncation. Companion harness mitigation lowered chunk size from 10 to 5 turn-pairs (in atomicmemory-benchmarks PR #8) to reduce the chance of hitting the limit at all. This server-side bump is defense-in-depth.

Subagent that was supposed to write a plan-only doc also produced preliminary code on this branch. Preserving for later — autoresearch loop will treat as a future iteration candidate. NOT verified, NOT ready for review.

Adds a post-rerank confidence computation that signals when retrieval results are poorly separated and/or absolutely weak. Targets BEAM abstention (ABS) ability where Honcho scores below baseline. - New retrieval-confidence-gate.ts (~70 LOC) with computeRetrievalConfidence - Gates on similarity (stable, scale-invariant) not score (rewritten by RRF) - Four new RuntimeConfig fields, all default-off, allowlisted in INTERNAL_POLICY_CONFIG_FIELDS for config_override A/B testing - Threaded through search-pipeline → memory-search → routes → response - Emits retrieval_confidence JSON in search responses when enabled - Trace event 'low-confidence-gate' fires when low confidence detected - 10 unit tests covering: disabled, empty, strong separation, narrow+ weak, strong-margin override, normalizer/floor overrides Plan: experiments/exp-14-implementation-plan-2026-04-29.md

moralespanitz added 3 commits April 29, 2026 12:55

wip(exp-14): partial retrieval-confidence-gate implementation

8b137e3

Subagent that was supposed to write a plan-only doc also produced preliminary code on this branch. Preserving for later — autoresearch loop will treat as a future iteration candidate. NOT verified, NOT ready for review.

moralespanitz requested a review from ethanj as a code owner April 29, 2026 21:01

moralespanitz marked this pull request as draft April 30, 2026 05:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search): EXP-14 — retrieval-side abstention gate#8

feat(search): EXP-14 — retrieval-side abstention gate#8
moralespanitz wants to merge 3 commits intomainfrom
experiment/exp-14-retrieval-confidence-gate

moralespanitz commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

moralespanitz commented Apr 29, 2026

EXP-14 — Retrieval-side abstention gate

What

Why

How

Files changed

Division of labor

Tests

Rollback

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant