feat(consolidate): batch-propose supersedes for near-duplicate approved claims#336
feat(consolidate): batch-propose supersedes for near-duplicate approved claims#336dripsmvcp wants to merge 1 commit into
Conversation
…ed claims retroactive consolidation pass: clusters same-kind approved claims by embedding cosine similarity, picks a deterministic survivor per cluster (highest confidence, then most recent, then lexicographic id), and emits supersede or merge intents through the review gate. the pass never mutates durable claims directly — it only proposes. registered at all four surfaces (mcp, jsonl, cli, capabilities). tests cover config loading, survivor selection, clustering, proposal dedup, exclusion of archived/superseded claims, and the approve flow.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR implements kb.consolidate (closing #308), a retroactive cleanup pass that clusters near-duplicate already-approved claims by embedding cosine similarity, nominates a deterministic survivor per cluster (confidence → updated_at → id), and files supersede-relation or merge-claim proposals into the pending queue for human review. It is registered across the four standard kb.* surfaces (MCP, JSONL, capabilities, CLI) and is config-driven via consolidate.* keys. The pass is review-gated — it only proposes and never mutates durable claims directly.
Changes:
- Adds
src/vouch/consolidate.pywith clustering (union-find over pairwise cosine), survivor selection, defensive config loading, and supersede/merge proposal filing. - Registers
kb.consolidateat all four surfaces and documents it inCHANGELOG.md. - Adds unit tests (config/survivor/registration) and embedding-dependent clustering/proposal/approve-flow tests.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/vouch/consolidate.py | New consolidation module: clustering, survivor selection, config, supersede/merge proposals |
| src/vouch/server.py | MCP kb_consolidate tool wrapper |
| src/vouch/jsonl_server.py | JSONL _h_consolidate handler + HANDLERS entry |
| src/vouch/capabilities.py | Adds kb.consolidate to METHODS |
| src/vouch/cli.py | vouch consolidate CLI mirror with JSON/human output |
| CHANGELOG.md | Documents the new feature under [Unreleased] |
| tests/test_consolidate.py | Config, survivor-selection, and no-embedding tests |
| tests/embeddings/test_consolidate.py | Clustering, dry-run, supersede/merge, exclusion, dedup, approve-flow tests |
Key point for reviewers: the module's docstrings and issue #308's acceptance criteria state that approving a consolidation proposal invokes lifecycle.supersede, but the current approve() path only persists the Relation/claim artifact and never supersedes the source claims — leaving them re-eligible on later passes. This behavioral gap warrants a human design decision.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| In supersede mode: for each non-survivor, proposes a relation that on | ||
| approval calls lifecycle.supersede(old=member, new=survivor). | ||
|
|
||
| In merge mode: proposes a single new claim per cluster that unions the | ||
| evidence/entities/tags, then supersedes every member on approval. |
| claims.sort( | ||
| key=lambda c: (-c.confidence, c.updated_at.isoformat(), c.id), | ||
| reverse=False, | ||
| ) | ||
| # Highest confidence first: negate so ascending sort picks it. | ||
| # Among ties: most recent updated_at (reverse chrono → want latest | ||
| # first, so sort descending on timestamp string). | ||
| # Final tiebreak: smallest id (ascending). | ||
| best = claims[0] |
| # Pairwise cosine (O(n²) like scan_all). | ||
| pairs: list[tuple[str, str, float]] = [] | ||
| keys = sorted(vecs.keys()) | ||
| for i, k1 in enumerate(keys): | ||
| for k2 in keys[i + 1:]: | ||
| if vecs[k1].shape != vecs[k2].shape: | ||
| continue | ||
| cos = float(vecs[k1] @ vecs[k2]) | ||
| if cos >= threshold: | ||
| pairs.append((k1, k2, cos)) |
| result = cons.consolidate(store, threshold=0.99, max_clusters=0, dry_run=True) | ||
| # max_clusters=0 is invalid, falls back to default (50). | ||
| # But let's test with max_clusters=1 — should return at most 1. | ||
| result = cons.consolidate(store, threshold=0.99, max_clusters=1, dry_run=True) |
summary
retroactive consolidation pass for near-duplicate approved claims. clusters same-kind claims by embedding cosine similarity (reuses the
dedup.scan_allvector machinery), picks a deterministic survivor per cluster (highest confidence → most recent → lexicographic id), and emits supersede or merge intents into the pending queue for human review.kb.consolidateregistered at all four surfaces (mcp, jsonl, cli, capabilities)--mode=supersede(default): proposes per-pair supersede relation proposals--mode=merge: proposes a single union claim per cluster viapropose_claim--dry-run: reports clusters without writing anything--max-clusters: bounds a single passconsolidate.threshold,consolidate.mode,consolidate.max_clusterstest plan
closes #308