feat(context): memory consolidation engine (#498, #679, #680, #681, #682, #683)#708
Merged
Conversation
, #683) Distill episodic memory into durable, deduplicated, provenance-stamped facts. - context/consolidation.py: deterministic cluster_episodes (#679), promote_clusters with provenance + max-sensitivity inheritance (#680), decay_episodes/decay_facts report-only over append-only stores (#681), and the consolidate() orchestrator (idempotent apply via content-addressed fact IDs). Private helpers in _consolidation_helpers.py keep it <=300 lines. - context/_consolidation_merge.py: optional fail-closed call_fn canonicalizer that rejects ungrounded completions, deterministic fallback (#682). - context/consolidation_types.py: ConsolidationPolicy / EpisodeCluster / PromotedFact / ConsolidationReport with to_dict/from_dict. - eval/consolidation.py: evaluate_consolidation -> ConsolidationEvalReport (precision/coverage + dedup ratio), offline + deterministic (#683). - CLI: `contextweaver consolidate` subcommand over JSON-serialised stores. - Re-exports via contextweaver.context and contextweaver.eval; regenerated api/public_api.txt; AGENTS.md module map + Key Types; CHANGELOG. Pure stdlib, no new dependency. Standalone functions mirror handoff (no new ContextManager method surface). Tests: test_consolidation.py, test_eval_consolidation.py, plus CLI tests in test_cli.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JiHvVpWeqApRTyQReS4QJX
There was a problem hiding this comment.
Pull request overview
This PR adds a memory consolidation engine to the Context Engine, enabling episodic memory to be clustered, promoted into durable facts (with provenance + max-sensitivity inheritance), and evaluated/operated via a new CLI and eval harness.
Changes:
- Introduces a deterministic consolidation pipeline (
cluster_episodes,promote_clusters,decay_*,consolidate) plus supporting datatypes. - Adds an offline, stdlib-only evaluation harness (
evaluate_consolidation) and CLI subcommand (contextweaver consolidate). - Updates public API exports and documentation artifacts (AGENTS.md, CHANGELOG, public_api manifest) and adds comprehensive tests.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/contextweaver/context/consolidation.py |
New consolidation orchestrator + clustering/promotion/decay functions; optional apply-to-FactStore. |
src/contextweaver/context/consolidation_types.py |
New policy/report dataclasses with to_dict/from_dict and validation. |
src/contextweaver/context/_consolidation_helpers.py |
Deterministic helper utilities (canonical text, timestamps, sensitivity, IDs, decay predicate). |
src/contextweaver/context/_consolidation_merge.py |
Optional fail-closed call_fn merge/refinement with grounding guardrails. |
src/contextweaver/eval/consolidation.py |
New consolidation quality evaluation harness and report type. |
src/contextweaver/eval/__init__.py |
Re-export consolidation eval surface from contextweaver.eval. |
src/contextweaver/context/__init__.py |
Re-export consolidation surface from contextweaver.context. |
src/contextweaver/__main__.py |
Adds contextweaver consolidate CLI command. |
tests/test_consolidation.py |
Unit tests covering determinism, promotion thresholds, sensitivity inheritance, decay, apply/idempotence, and serialization. |
tests/test_eval_consolidation.py |
Unit tests for eval precision/coverage/dedup metrics and report round-trip. |
tests/test_cli.py |
CLI integration tests for consolidate JSON output, apply + facts-out, and bad input handling. |
CHANGELOG.md |
Documents the new consolidation feature set and public surface. |
api/public_api.txt |
Regenerated public API manifest including new consolidation/eval APIs. |
AGENTS.md |
Module map + key types updated to include consolidation engine and eval harness. |
…iew) Addresses Copilot review on #708: - parse_iso: normalise the RFC 3339 `Z` suffix to `+00:00` (parses on Python 3.10) and convert tz-aware values to naive UTC, mirroring the repo's ISO convention (RoutingDecision.from_dict). Single normalisation point feeding both seen_bounds and is_decayed. - is_decayed: compare against a timedelta instead of floored whole days (so a timestamp older than the horizon by <24h still decays) and normalise as_of to naive UTC so a tz-aware as_of/stamp never raises TypeError. - consolidate(apply=True): stamp the policy decay timestamp key (= last_seen) on promoted facts so they are themselves decay-eligible on later runs, not only their source episodes. - CLI `consolidate`: parse --as-of via parse_iso (Z support) and default the decay reference to now so --decay-after-days takes effect without an explicit --as-of. Tests: Z-suffix + tz-aware decay, sub-day decay granularity, seen-bounds string preservation, consolidated-fact decay on a later run, and CLI default-decay / Z --as-of cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JiHvVpWeqApRTyQReS4QJX
Benchmark delta (vs
|
| size | recall@k (head Δ vs base) | MRR (head Δ vs base) | p99 (ms) |
|---|---|---|---|
| 50 | ✅ 0.5649 (+0.0000) | ✅ 0.4978 (+0.0000) | ✅ 0.281 (base 0.759) |
| 83 | ✅ 0.3825 (+0.0000) | ✅ 0.3242 (+0.0000) | ✅ 0.453 (base 1.134) |
| 1000 | ✅ 0.1475 (+0.0000) | ✅ 0.1456 (+0.0000) | ✅ 25.841 (base 41.711) |
Per-backend × per-size matrix
| backend | size | recall@k (Δ) | MRR (Δ) | p99 (ms) |
|---|---|---|---|---|
| bm25 | 100 | ✅ 0.3825 (+0.0000) | ✅ 0.3399 (+0.0000) | ✅ 3.706 (base 8.140) |
| bm25 | 500 | ✅ 0.2250 (+0.0000) | ✅ 0.2165 (+0.0000) | ✅ 18.878 (base 38.989) |
| bm25 | 1000 | ✅ 0.1575 (+0.0000) | ✅ 0.1525 (+0.0000) | ✅ 49.497 (base 111.716) |
| embedding_hashing | 100 | ✅ 0.5175 (+0.0000) | ✅ 0.4360 (+0.0000) | ✅ 4.301 (base 7.225) |
| embedding_hashing | 500 | ✅ 0.2700 (+0.0000) | ✅ 0.2674 (+0.0000) | ✅ 25.399 (base 44.182) |
| embedding_hashing | 1000 | ✅ 0.2000 (+0.0000) | ✅ 0.1931 (+0.0000) | ✅ 60.307 (base 98.277) |
| embedding_st | 100 | skipped (skipped: missing sentence-transformers) | — | — |
| embedding_st | 500 | skipped (skipped: missing sentence-transformers) | — | — |
| embedding_st | 1000 | skipped (skipped: missing sentence-transformers) | — | — |
| fuzzy | 100 | skipped (skipped: missing rapidfuzz) | — | — |
| fuzzy | 500 | skipped (skipped: missing rapidfuzz) | — | — |
| fuzzy | 1000 | skipped (skipped: missing rapidfuzz) | — | — |
| tfidf | 100 | ✅ 0.3825 (+0.0000) | ✅ 0.3220 (+0.0000) | ✅ 0.636 (base 1.102) |
| tfidf | 500 | ✅ 0.2325 (+0.0000) | ✅ 0.2314 (+0.0000) | ✅ 5.530 (base 11.492) |
| tfidf | 1000 | ✅ 0.1475 (+0.0000) | ✅ 0.1456 (+0.0000) | ✅ 22.306 (base 50.755) |
Context pipeline (per scenario)
| scenario | tokens | dropped | dedup |
|---|---|---|---|
| large_catalog | 1480 (base 1514, Δ-34) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| long_conversation | 2500 (base 2548, Δ-48) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| mixed_payload | 488 (base 497, Δ-9) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| short_conversation | 487 (base 496, Δ-9) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| stress_conversation | 6590 (base 6651, Δ-61) | 11 (base 7, Δ+4) | 4 (base 4, Δ+0) |
| tiny_payload | 256 (base 267, Δ-11) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
Numbers come from make benchmark / make benchmark-matrix.
Latency is hardware-dependent — treat the markers as a rough guide.
See benchmarks/scorecard.md for the full picture.
The consolidate CLI defaulted the decay reference to datetime.now() (naive local time), while parse_iso normalises episode/fact timestamps to naive UTC. On a non-UTC host this skewed the decay horizon by the host's offset. Use datetime.now(timezone.utc) so the reference matches the stored stamps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NUfbnfYYJDQ5ei5quKAftg
Resolve CHANGELOG.md conflict by keeping both Unreleased "Added" entries: the memory consolidation engine (this PR) and the package-metadata drift guard (#473, from main). All other files auto-merged cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JHED1e6xNDDqjgR8v6Cvcm
Audit follow-ups on the memory consolidation engine:
- F1 (security): the optional LLM merge guardrail grounded only on
tokenize() output, which strips stop-words, so an injected negation
("is safe" -> "is not safe") shared the same content tokens and was
accepted, silently inverting a durable fact. Add a negation-term check
(_negations) that rejects any polarity word absent from the source
notes; soften the docstring's grounding claim; add a regression test.
- F4: decay_facts now honours datetime timestamps (not just ISO strings)
via the new coerce_iso helper; add a mixed-type decay test.
- F3: re-export parse_iso from the public consolidation module so the CLI
no longer imports from the private _consolidation_helpers module.
- F2: document cluster_episodes' O(n*k) greedy, offline-batch complexity.
Keeps consolidation.py at the 300-line ceiling by moving timestamp
coercion into _consolidation_helpers. make ci green (the one failing test
is the pre-existing offline-tiktoken artifact).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JHED1e6xNDDqjgR8v6Cvcm
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the memory consolidation engine epic as one coherent PR: distill
episodic memory into durable, deduplicated, provenance-stamped facts. The epic
(#498) and its five children are delivered together because they share one
module surface, one set of data types, and one implementation path.
Closes #498. Closes #679. Closes #680. Closes #681. Closes #682. Closes #683.
What changed
context/consolidation.py— deterministic engine:cluster_episodes(Deterministic episodic clustering and dedupe engine #679): stable, idempotent Jaccard clustering of episodes(sorted-ID iteration, ties break by ID).
promote_clusters(Fact promotion rules with provenance links #680): promotes clusters meetingmin_occurrences/min_sessionsintoPromotedFactrecords with full source provenance andthe maximum source sensitivity (inherited up, never down).
decay_episodes/decay_facts(Memory decay and expiry policy mechanics #681): report entries past the decayhorizon without deleting them — the stores are append-only.
consolidate(...)orchestrator;apply=Trueupserts content-addressedfacts, so re-running over an unchanged store is a no-op (idempotent).
context/_consolidation_helpers.py— private deterministic helpers(canonical text, max-sensitivity, session counting, ISO parsing, content
fact-IDs, decay predicate); keeps
consolidation.py≤300 lines.context/_consolidation_merge.py(Optional model-assisted consolidation adapter #682) — optional, fail-closedcall_fncanonicalizer: rejects blank / ungrounded completions (any token absent from
the source cluster) and falls back to the deterministic text. No LLM SDK dep;
disabled under
deterministic=True.context/consolidation_types.py—ConsolidationPolicy,EpisodeCluster,PromotedFact,ConsolidationReportwithto_dict/from_dict.eval/consolidation.py(Consolidation quality evaluation harness #683) —evaluate_consolidation→ConsolidationEvalReport(precision / coverage vs an optional gold set +dedup ratio); offline, deterministic.
contextweaver consolidateruns the pipeline over JSON-serialisedstores (
--apply/--facts-out/ decay + threshold flags).contextweaver.contextandcontextweaver.eval; regeneratedapi/public_api.txt;AGENTS.mdmodule map + Key Types;CHANGELOG.md.Why
contextweaveralready ingests and phase-selects memory, but memory onlyaccumulates. This closes the loop: agents get compounding, deduplicated
knowledge with auditable provenance. The engine is built as standalone
functions (mirroring
handoff.py) rather than newContextManagermethods,honoring the constraint in
manager.pyagainst growing its public surface.How verified
Run in a venv (
--system-site-packages):ruff check src/ tests/ examples/ scripts/— All checks passed;ruff formatclean.mypy src/ examples/ scripts/— Success: no issues found in 242 source files.pytest(full suite) — 2790 passed, 31 skipped, 1 xfailed. The oneremaining failure,
tests/test_mcp_serve_cli.py::test_serve_dry_run_writes_catalog_diagnostic_event,is a pre-existing, offline-only
tiktokenartifact (403 fetchingcl100k_base→ fallback warning on stderr); it fails identically on a cleancheckout and is expected to pass in CI with cached encodings.
make drift-check— all 7 generated artifacts up to date (incl. regeneratedpublic-API manifest);
make module-size-check,make doc-snippets-check,make readme-version-check— OK.make demoandmake example— exit 0.pytest tests/test_consolidation.py tests/test_eval_consolidation.py— 31 passed; consolidate CLI tests in
tests/test_cli.py— 3 passed.Tradeoffs / scope notes (Mode B)
marking, but the store protocols are append-only with no update method;
reporting decayed IDs (and letting callers act) respects that invariant
rather than weakening it. Noted as a natural follow-up if an update path lands.
contextweaver.__init__re-export. That module is a frozengrandfathered size ceiling (358 lines); growing it is disallowed. The public
API is exposed via
contextweaver.context/contextweaver.eval. Adding rootparity with
handoffcan follow with a deliberate baseline bump if maintainersprefer it.
consistent with the security-grade sensitivity model.
Checklist
promotion thresholds, provenance, max-sensitivity, decay boundary,
idempotent apply, LLM merge accept/reject/deterministic, eval, CLI)
make cichecks pass locally (see How verified; the one failing test isthe unrelated offline-
tiktokenartifact)CHANGELOG.mdupdated under## [Unreleased]api/public_api.txtmake module-size-checkOK)AGENTS.mdmodule map + Key Types)🤖 Generated with Claude Code
https://claude.ai/code/session_01JiHvVpWeqApRTyQReS4QJX
Generated by Claude Code