Skip to content

feat(context): memory consolidation engine (#498, #679, #680, #681, #682, #683)#708

Merged
dgenio merged 5 commits into
mainfrom
claude/issue-triage-grouping-5pw848
Jun 18, 2026
Merged

feat(context): memory consolidation engine (#498, #679, #680, #681, #682, #683)#708
dgenio merged 5 commits into
mainfrom
claude/issue-triage-grouping-5pw848

Conversation

@dgenio

@dgenio dgenio commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Summary

Implements the memory consolidation engine epic as one coherent PR: distill
episodic memory into durable, deduplicated, provenance-stamped facts. The epic
(#498) and its five children are delivered together because they share one
module surface, one set of data types, and one implementation path.

Closes #498. Closes #679. Closes #680. Closes #681. Closes #682. Closes #683.

What changed

  • context/consolidation.py — deterministic engine:
  • context/_consolidation_helpers.py — private deterministic helpers
    (canonical text, max-sensitivity, session counting, ISO parsing, content
    fact-IDs, decay predicate); keeps consolidation.py ≤300 lines.
  • context/_consolidation_merge.py (Optional model-assisted consolidation adapter #682) — optional, fail-closed call_fn
    canonicalizer: rejects blank / ungrounded completions (any token absent from
    the source cluster) and falls back to the deterministic text. No LLM SDK dep;
    disabled under deterministic=True.
  • context/consolidation_types.pyConsolidationPolicy,
    EpisodeCluster, PromotedFact, ConsolidationReport with to_dict /
    from_dict.
  • eval/consolidation.py (Consolidation quality evaluation harness #683) — evaluate_consolidation
    ConsolidationEvalReport (precision / coverage vs an optional gold set +
    dedup ratio); offline, deterministic.
  • CLIcontextweaver consolidate runs the pipeline over JSON-serialised
    stores (--apply / --facts-out / decay + threshold flags).
  • Re-exports via contextweaver.context and contextweaver.eval; regenerated
    api/public_api.txt; AGENTS.md module map + Key Types; CHANGELOG.md.

Why

contextweaver already ingests and phase-selects memory, but memory only
accumulates. This closes the loop: agents get compounding, deduplicated
knowledge with auditable provenance. The engine is built as standalone
functions
(mirroring handoff.py) rather than new ContextManager methods,
honoring the constraint in manager.py against growing its public surface.

How verified

Run in a venv (--system-site-packages):

  • ruff check src/ tests/ examples/ scripts/All checks passed;
    ruff format clean.
  • mypy src/ examples/ scripts/Success: no issues found in 242 source files.
  • pytest (full suite) — 2790 passed, 31 skipped, 1 xfailed. The one
    remaining failure, tests/test_mcp_serve_cli.py::test_serve_dry_run_writes_catalog_diagnostic_event,
    is a pre-existing, offline-only tiktoken artifact (403 fetching
    cl100k_base → fallback warning on stderr); it fails identically on a clean
    checkout and is expected to pass in CI with cached encodings.
  • make drift-check — all 7 generated artifacts up to date (incl. regenerated
    public-API manifest); make module-size-check, make doc-snippets-check,
    make readme-version-check — OK.
  • make demo and make example — exit 0.
  • New tests: pytest tests/test_consolidation.py tests/test_eval_consolidation.py
    31 passed; consolidate CLI tests in tests/test_cli.py3 passed.

Tradeoffs / scope notes (Mode B)

  • Decay is report-only. The acceptance criteria allow tombstone/status
    marking, but the store protocols are append-only with no update method;
    reporting decayed IDs (and letting callers act) respects that invariant
    rather than weakening it. Noted as a natural follow-up if an update path lands.
  • No top-level contextweaver.__init__ re-export. That module is a frozen
    grandfathered size ceiling (358 lines); growing it is disallowed. The public
    API is exposed via contextweaver.context / contextweaver.eval. Adding root
    parity with handoff can follow with a deliberate baseline bump if maintainers
    prefer it.
  • Sensitivity inheritance only ever raises a fact's label (max of sources) —
    consistent with the security-grade sensitivity model.

Checklist

  • Tests added for new functionality (clustering determinism/idempotence,
    promotion thresholds, provenance, max-sensitivity, decay boundary,
    idempotent apply, LLM merge accept/reject/deterministic, eval, CLI)
  • make ci checks pass locally (see How verified; the one failing test is
    the unrelated offline-tiktoken artifact)
  • CHANGELOG.md updated under ## [Unreleased]
  • Google-style docstrings on all new public APIs
  • Public-API change → regenerated api/public_api.txt
  • Every module ≤300 lines (make module-size-check OK)
  • Agent-facing docs updated (AGENTS.md module map + Key Types)

🤖 Generated with Claude Code

https://claude.ai/code/session_01JiHvVpWeqApRTyQReS4QJX


Generated by Claude Code

, #683)

Distill episodic memory into durable, deduplicated, provenance-stamped facts.

- context/consolidation.py: deterministic cluster_episodes (#679),
  promote_clusters with provenance + max-sensitivity inheritance (#680),
  decay_episodes/decay_facts report-only over append-only stores (#681),
  and the consolidate() orchestrator (idempotent apply via content-addressed
  fact IDs). Private helpers in _consolidation_helpers.py keep it <=300 lines.
- context/_consolidation_merge.py: optional fail-closed call_fn canonicalizer
  that rejects ungrounded completions, deterministic fallback (#682).
- context/consolidation_types.py: ConsolidationPolicy / EpisodeCluster /
  PromotedFact / ConsolidationReport with to_dict/from_dict.
- eval/consolidation.py: evaluate_consolidation -> ConsolidationEvalReport
  (precision/coverage + dedup ratio), offline + deterministic (#683).
- CLI: `contextweaver consolidate` subcommand over JSON-serialised stores.
- Re-exports via contextweaver.context and contextweaver.eval; regenerated
  api/public_api.txt; AGENTS.md module map + Key Types; CHANGELOG.

Pure stdlib, no new dependency. Standalone functions mirror handoff (no new
ContextManager method surface). Tests: test_consolidation.py,
test_eval_consolidation.py, plus CLI tests in test_cli.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JiHvVpWeqApRTyQReS4QJX
Copilot AI review requested due to automatic review settings June 17, 2026 10:05

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a memory consolidation engine to the Context Engine, enabling episodic memory to be clustered, promoted into durable facts (with provenance + max-sensitivity inheritance), and evaluated/operated via a new CLI and eval harness.

Changes:

  • Introduces a deterministic consolidation pipeline (cluster_episodes, promote_clusters, decay_*, consolidate) plus supporting datatypes.
  • Adds an offline, stdlib-only evaluation harness (evaluate_consolidation) and CLI subcommand (contextweaver consolidate).
  • Updates public API exports and documentation artifacts (AGENTS.md, CHANGELOG, public_api manifest) and adds comprehensive tests.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/contextweaver/context/consolidation.py New consolidation orchestrator + clustering/promotion/decay functions; optional apply-to-FactStore.
src/contextweaver/context/consolidation_types.py New policy/report dataclasses with to_dict/from_dict and validation.
src/contextweaver/context/_consolidation_helpers.py Deterministic helper utilities (canonical text, timestamps, sensitivity, IDs, decay predicate).
src/contextweaver/context/_consolidation_merge.py Optional fail-closed call_fn merge/refinement with grounding guardrails.
src/contextweaver/eval/consolidation.py New consolidation quality evaluation harness and report type.
src/contextweaver/eval/__init__.py Re-export consolidation eval surface from contextweaver.eval.
src/contextweaver/context/__init__.py Re-export consolidation surface from contextweaver.context.
src/contextweaver/__main__.py Adds contextweaver consolidate CLI command.
tests/test_consolidation.py Unit tests covering determinism, promotion thresholds, sensitivity inheritance, decay, apply/idempotence, and serialization.
tests/test_eval_consolidation.py Unit tests for eval precision/coverage/dedup metrics and report round-trip.
tests/test_cli.py CLI integration tests for consolidate JSON output, apply + facts-out, and bad input handling.
CHANGELOG.md Documents the new consolidation feature set and public surface.
api/public_api.txt Regenerated public API manifest including new consolidation/eval APIs.
AGENTS.md Module map + key types updated to include consolidation engine and eval harness.

Comment thread src/contextweaver/context/_consolidation_helpers.py
Comment thread src/contextweaver/context/_consolidation_helpers.py
Comment thread src/contextweaver/context/consolidation.py
Comment thread src/contextweaver/__main__.py Outdated
…iew)

Addresses Copilot review on #708:

- parse_iso: normalise the RFC 3339 `Z` suffix to `+00:00` (parses on Python
  3.10) and convert tz-aware values to naive UTC, mirroring the repo's ISO
  convention (RoutingDecision.from_dict). Single normalisation point feeding
  both seen_bounds and is_decayed.
- is_decayed: compare against a timedelta instead of floored whole days (so a
  timestamp older than the horizon by <24h still decays) and normalise as_of to
  naive UTC so a tz-aware as_of/stamp never raises TypeError.
- consolidate(apply=True): stamp the policy decay timestamp key (= last_seen) on
  promoted facts so they are themselves decay-eligible on later runs, not only
  their source episodes.
- CLI `consolidate`: parse --as-of via parse_iso (Z support) and default the
  decay reference to now so --decay-after-days takes effect without an explicit
  --as-of.

Tests: Z-suffix + tz-aware decay, sub-day decay granularity, seen-bounds string
preservation, consolidated-fact decay on a later run, and CLI default-decay /
Z --as-of cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JiHvVpWeqApRTyQReS4QJX
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

Benchmark delta (vs main)

Soft regression feedback only — this comment never blocks the PR.
Latency budget: ⚠️ when head > base × 1.3. Accuracy budget: ⚠️ when head < base - 1pp.

Routing summary (single backend × catalog sizes)

size recall@k (head Δ vs base) MRR (head Δ vs base) p99 (ms)
50 ✅ 0.5649 (+0.0000) ✅ 0.4978 (+0.0000) ✅ 0.281 (base 0.759)
83 ✅ 0.3825 (+0.0000) ✅ 0.3242 (+0.0000) ✅ 0.453 (base 1.134)
1000 ✅ 0.1475 (+0.0000) ✅ 0.1456 (+0.0000) ✅ 25.841 (base 41.711)

Per-backend × per-size matrix

backend size recall@k (Δ) MRR (Δ) p99 (ms)
bm25 100 ✅ 0.3825 (+0.0000) ✅ 0.3399 (+0.0000) ✅ 3.706 (base 8.140)
bm25 500 ✅ 0.2250 (+0.0000) ✅ 0.2165 (+0.0000) ✅ 18.878 (base 38.989)
bm25 1000 ✅ 0.1575 (+0.0000) ✅ 0.1525 (+0.0000) ✅ 49.497 (base 111.716)
embedding_hashing 100 ✅ 0.5175 (+0.0000) ✅ 0.4360 (+0.0000) ✅ 4.301 (base 7.225)
embedding_hashing 500 ✅ 0.2700 (+0.0000) ✅ 0.2674 (+0.0000) ✅ 25.399 (base 44.182)
embedding_hashing 1000 ✅ 0.2000 (+0.0000) ✅ 0.1931 (+0.0000) ✅ 60.307 (base 98.277)
embedding_st 100 skipped (skipped: missing sentence-transformers)
embedding_st 500 skipped (skipped: missing sentence-transformers)
embedding_st 1000 skipped (skipped: missing sentence-transformers)
fuzzy 100 skipped (skipped: missing rapidfuzz)
fuzzy 500 skipped (skipped: missing rapidfuzz)
fuzzy 1000 skipped (skipped: missing rapidfuzz)
tfidf 100 ✅ 0.3825 (+0.0000) ✅ 0.3220 (+0.0000) ✅ 0.636 (base 1.102)
tfidf 500 ✅ 0.2325 (+0.0000) ✅ 0.2314 (+0.0000) ✅ 5.530 (base 11.492)
tfidf 1000 ✅ 0.1475 (+0.0000) ✅ 0.1456 (+0.0000) ✅ 22.306 (base 50.755)

Context pipeline (per scenario)

scenario tokens dropped dedup
large_catalog 1480 (base 1514, Δ-34) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
long_conversation 2500 (base 2548, Δ-48) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
mixed_payload 488 (base 497, Δ-9) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
short_conversation 487 (base 496, Δ-9) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
stress_conversation 6590 (base 6651, Δ-61) 11 (base 7, Δ+4) 4 (base 4, Δ+0)
tiny_payload 256 (base 267, Δ-11) 0 (base 0, Δ+0) 0 (base 0, Δ+0)

Numbers come from make benchmark / make benchmark-matrix.
Latency is hardware-dependent — treat the markers as a rough guide.
See benchmarks/scorecard.md for the full picture.

claude added 3 commits June 17, 2026 19:41
The consolidate CLI defaulted the decay reference to datetime.now() (naive
local time), while parse_iso normalises episode/fact timestamps to naive
UTC. On a non-UTC host this skewed the decay horizon by the host's offset.
Use datetime.now(timezone.utc) so the reference matches the stored stamps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NUfbnfYYJDQ5ei5quKAftg
Resolve CHANGELOG.md conflict by keeping both Unreleased "Added" entries:
the memory consolidation engine (this PR) and the package-metadata drift
guard (#473, from main). All other files auto-merged cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JHED1e6xNDDqjgR8v6Cvcm
Audit follow-ups on the memory consolidation engine:

- F1 (security): the optional LLM merge guardrail grounded only on
  tokenize() output, which strips stop-words, so an injected negation
  ("is safe" -> "is not safe") shared the same content tokens and was
  accepted, silently inverting a durable fact. Add a negation-term check
  (_negations) that rejects any polarity word absent from the source
  notes; soften the docstring's grounding claim; add a regression test.
- F4: decay_facts now honours datetime timestamps (not just ISO strings)
  via the new coerce_iso helper; add a mixed-type decay test.
- F3: re-export parse_iso from the public consolidation module so the CLI
  no longer imports from the private _consolidation_helpers module.
- F2: document cluster_episodes' O(n*k) greedy, offline-batch complexity.

Keeps consolidation.py at the 300-line ceiling by moving timestamp
coercion into _consolidation_helpers. make ci green (the one failing test
is the pre-existing offline-tiktoken artifact).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JHED1e6xNDDqjgR8v6Cvcm
@dgenio dgenio merged commit 31327d3 into main Jun 18, 2026
9 checks passed
@dgenio dgenio deleted the claude/issue-triage-grouping-5pw848 branch June 18, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants