distill: byte-deterministic consensus pipeline#35
Conversation
Replace the LLM-writes-spec procedure with a four-stage pipeline that
produces a byte-deterministic spec. The skill spawns K subagents in
parallel via the Agent tool, each producing a structured JSON inventory
of the codebase; the skill then canonicalises each, merges them into a
consensus, and translates that to the .allium spec via three
pure-function Node scripts shipped with the plugin.
Architecture:
K subagents (Agent tool)
| inventory-i.json (per subagent, follows references/inventory-schema.md)
v
scripts/canonicalize-inventory.mjs # convention normalisation
v
scripts/merge-inventories.mjs # K-way majority/modal consensus
v
scripts/inventory-to-spec.mjs # pure-function translator
v
allium check # validation
The three scripts are pure functions: same K canonical inventories always
produce a byte-identical merged inventory, which always produces a
byte-identical .allium spec. The only non-determinism is the LLM stage
producing the K raw inventories; consensus voting washes most of it out.
Files:
- skills/distill/SKILL.md rewritten as orchestrator (149 lines)
- skills/distill/references/inventory-schema.md new (232 lines)
- scripts/canonicalize-inventory.mjs new (230 lines)
- scripts/merge-inventories.mjs new (201 lines)
- scripts/inventory-to-spec.mjs new (553 lines)
Measured on a fresh ~500 LOC insurance-claims Python fixture, comparing
v3.3.0 (LLM writes the spec) against this pipeline:
metric v3.3.0 baseline this PR
----------------------------+----------------+-------------------
allium check 6/6 pass 6/6 pass
pairwise spec diff median 487 lines byte-identical*
entity-name Jaccard 0.75 1.00
rule-name Jaccard 0.28 1.00
9-pattern fixture coverage ~6/9 9/9
*for a fixed set of K canonical inventories. Across LLM rolls the K
inventories vary slightly; consensus voting washes most of it out and
the resulting specs are structurally identical with small body-only
diffs.
Cost: K x inference per invocation. Default K=3, configurable in the
orchestrator. Wall time dominated by the slowest of K parallel
subagents (~5-10 min on Opus for the fixture).
Caveats:
- Requires Node available on PATH (stdlib only, no extra deps).
- Subagent prompt explicitly forbids invoking the distill skill itself
(would recurse); uses general-purpose subagents reading the schema.
- The 9/9 vs ~6/9 coverage gain is partly because the explicit schema
forces the LLM to consider sections it might otherwise skip
(value types, auxiliary enums, contracts as separate constructs from
external entities, top-level invariants).
|
Flipping to draft while I rework the schema doc. Quick note on what surfaced during follow-up testing: this PR was iterated on a single Python fixture (insurance-claims; 6/6 pass, 9/9 patterns, 0 errors). I subsequently authored a second Python fixture (a sell-side trading-risk desk, ~1,370 LOC) and ran the pipeline on it. The architecture (subagents → canonicalise → merge → translate) generalised — structural fidelity was good (7+2 entities, 3 contracts, 2 surfaces, 18 rules — all expected blocks present), and per-run reproducibility held. But the distilled spec failed That's the wrong direction for a language-neutral spec language: it doesn't scale (every new source language would need its own table), and it puts the wrong burden on the schema doc. Reworking the schema in a positive, language-neutral framing — describing what Allium accepts (expression grammar, type catalogue, construct catalogue) rather than what specific source languages don't have. A translation principle replaces the per-language anti-patterns: "the inventory contains only Allium constructs; if the source code uses an idiom with no direct Allium equivalent, model the behaviour in Allium, not the syntax" — e.g. a conditional value becomes a Plan before flipping back to ready-for-review:
Will post back here when the rework lands. Happy to discuss the framing in advance if useful. |
The first version of the schema doc had drifted toward Python-specific guidance (an explicit Python -> Allium type table; anti-patterns named after Python idioms like ternaries and `%`). That doesn't scale -- every new source language would need its own table -- and it's the wrong framing for a language-neutral spec language. This rewrite describes what Allium accepts (positively): - Translation principle stated upfront: the inventory contains only Allium constructs; if a source-language idiom has no direct Allium equivalent, model the behaviour in Allium, not the syntax. - Allium type catalogue with literal-format rules (Duration `14.days`, Decimal/Integer/String/Boolean/Timestamp, `?` suffix nullability). - Allium expression grammar listed positively: field access, `+ - * /`, comparison, `and/or/not/in`, named aggregations, `implies`. Nothing outside this list is an Allium expression. - Conditional values: forbidden inline (in any source-language form); must be modelled as derived properties or split transitions. - No-free-names rule with worked example: every identifier in an expression resolves through a declared param / let / field / `now` / `config.<name>`. - Chained field access: explicit rule with worked example -- always write the full param chain (`assessment.claim.last_activity_at`, not bare `claim.last_activity_at`). - Strict alphabetisation rules carried over from the original. Validated on three fixtures from three different domains (and now two source languages): fixture source LOC errors warnings info -----------------+----------+-----+--------+----------+------ insurance-claims Python 855 0 2 43 trading-risk Python 1367 0 7 64 build-pipeline TypeScript 573 0 1 34 All three pass `allium check` with the consensus pipeline. The TypeScript fixture is the cross-language smoke test; it passed on the first attempt with the rewritten schema, confirming the schema is genuinely source-language-neutral. The translator scripts (canonicalize-inventory.mjs, merge-inventories.mjs, inventory-to-spec.mjs) and the orchestrator SKILL.md are unchanged from the prior commit on this branch.
|
Rewrite landed in commit 6432de3. Schema doc ( Validated on three fixtures across two source languages:
The build-pipeline TypeScript fixture is the cross-language smoke test — it passed on the first attempt with the rewritten schema, which is the best evidence I have that the schema is genuinely language-neutral (and not just "didn't break Python"). Pipeline scripts and orchestrator SKILL.md are unchanged from the earlier commit on this branch. Only the schema doc moved. Flipping back to ready-for-review. |
Diagnostic harness that drives K-sample distillation runs against a fixture codebase, canonicalises each inventory, merges into a consensus inventory and translates to a single deterministic spec. Captures per-sample metadata for variance analysis between plugin variants. The harness expects sibling plugins/ and fixtures/ trees that are not present on this branch; this commit lands the eval scripts only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Includes inventories, canonical/merged inventories, per-sample spec.allium, consensus specs, meta.json invocation records and generated comparison reports across all timestamped runs. Excludes the ~340 MB of node_modules trees that accumulated under propagate test sandboxes — those are regenerable via npm install inside each workdir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Replace the
distillskill's LLM-writes-spec procedure with a four-stage consensus pipeline. The skill spawns K subagents in parallel via the Agent tool, each producing a structured JSON inventory of the codebase; the skill then canonicalises each, merges them into a consensus inventory, and translates that to the.alliumspec via three pure-function Node scripts shipped with the plugin.Result: same K canonical inventories → byte-identical
.alliumspec. The only remaining non-determinism is in the LLM-driven inventory step itself, which consensus voting largely washes out.Architecture
The three scripts are pure functions of their inputs (Node, stdlib only — no extra deps). The orchestrator SKILL.md is 149 lines and is purely procedural; all schema/convention knowledge for subagents lives in
skills/distill/references/inventory-schema.md.Numbers
Measured on a freshly-authored ~500 LOC insurance-claims Python fixture exercising the nine patterns the skill needs to handle (status enums, guarded transitions, temporal rules, external webhook entity, third-party integration, implicit state machine, scattered logic, derived properties, FK→relationship).
allium check*For a fixed set of K canonical inventories. Across LLM rolls the K inventories vary slightly; consensus voting washes most of it out and the resulting specs are structurally identical with small body-only diffs.
Files
skills/distill/SKILL.md— rewritten as the orchestrator (149 lines, down from 816)skills/distill/references/inventory-schema.md— new (232 lines); the contract subagents followscripts/canonicalize-inventory.mjs— new (230 lines)scripts/merge-inventories.mjs— new (201 lines)scripts/inventory-to-spec.mjs— new (553 lines)No other files touched.
Costs and caveats
PATH. Scripts are stdlib-only.general-purposeagents reading the inventory-schema doc directly, not invocations of the distill skill. The subagent prompt explicitly forbids invoking any skill.Test plan
./allium-distilled/spec.alliumis produced andallium checkreports 0 errors.md5sum).Background
This came out of an A/B-harness experiment measuring distill determinism across nine iterations of SKILL.md tightenings on v3.3.0. The first eight iterations were LLM-writes-spec tweaks (best: 217-line pairwise diff median, all structural Jaccards at 1.00). The architectural pivot here — LLM produces the inventory, scripts produce the spec — was the only thing that broke through to byte-identical determinism without sacrificing feature coverage. Happy to share the harness if it's useful to the project.