distill: byte-deterministic consensus pipeline by panayotovk · Pull Request #35 · juxt/allium

panayotovk · 2026-05-17T08:43:17Z

Summary

Replace the distill skill's LLM-writes-spec procedure with a four-stage consensus pipeline. The skill spawns K subagents in parallel via the Agent tool, each producing a structured JSON inventory of the codebase; the skill then canonicalises each, merges them into a consensus inventory, and translates that to the .allium spec via three pure-function Node scripts shipped with the plugin.

Result: same K canonical inventories → byte-identical .allium spec. The only remaining non-determinism is in the LLM-driven inventory step itself, which consensus voting largely washes out.

Architecture

K subagents (Agent tool)
   |  inventory-i.json (each subagent reads references/inventory-schema.md)
   v
scripts/canonicalize-inventory.mjs   # convention normalisation
   v
scripts/merge-inventories.mjs        # K-way majority/modal consensus
   v
scripts/inventory-to-spec.mjs        # pure-function translator
   v
allium check                         # validation

The three scripts are pure functions of their inputs (Node, stdlib only — no extra deps). The orchestrator SKILL.md is 149 lines and is purely procedural; all schema/convention knowledge for subagents lives in skills/distill/references/inventory-schema.md.

Numbers

Measured on a freshly-authored ~500 LOC insurance-claims Python fixture exercising the nine patterns the skill needs to handle (status enums, guarded transitions, temporal rules, external webhook entity, third-party integration, implicit state machine, scattered logic, derived properties, FK→relationship).

Metric	v3.3.0 baseline (LLM writes spec)	This PR (consensus pipeline)
`allium check`	6/6 pass	6/6 pass
Pairwise spec-diff median	487 lines	byte-identical*
Entity-name Jaccard	0.75	1.00
Rule-name Jaccard	0.28	1.00
Fixture pattern coverage	~6/9	9/9

*For a fixed set of K canonical inventories. Across LLM rolls the K inventories vary slightly; consensus voting washes most of it out and the resulting specs are structurally identical with small body-only diffs.

Files

skills/distill/SKILL.md — rewritten as the orchestrator (149 lines, down from 816)
skills/distill/references/inventory-schema.md — new (232 lines); the contract subagents follow
scripts/canonicalize-inventory.mjs — new (230 lines)
scripts/merge-inventories.mjs — new (201 lines)
scripts/inventory-to-spec.mjs — new (553 lines)

No other files touched.

Costs and caveats

K× inference per invocation, default K=3 (configurable in the orchestrator). Wall time dominated by the slowest of K parallel subagents (~5–10 min on Opus for the fixture).
Requires Node on PATH. Scripts are stdlib-only.
Subagent recursion prevention: subagents are general-purpose agents reading the inventory-schema doc directly, not invocations of the distill skill. The subagent prompt explicitly forbids invoking any skill.
Coverage gain (~6/9 → 9/9) is partly the explicit schema forcing the LLM to consider sections it might otherwise skip — value types, auxiliary enums, contracts as separate from external entities, top-level invariants.

Test plan

Distill a small Python service using the plugin; confirm ./allium-distilled/spec.allium is produced and allium check reports 0 errors.
Run distill twice on the same codebase; confirm specs are structurally identical (entity / rule / contract / surface name sets match exactly).
Run the three scripts standalone over the same inputs twice; confirm byte-identical outputs (md5sum).
Optional: run on a larger codebase (>2000 LOC) and confirm K=3 subagents complete within the per-invocation timeout.

Background

This came out of an A/B-harness experiment measuring distill determinism across nine iterations of SKILL.md tightenings on v3.3.0. The first eight iterations were LLM-writes-spec tweaks (best: 217-line pairwise diff median, all structural Jaccards at 1.00). The architectural pivot here — LLM produces the inventory, scripts produce the spec — was the only thing that broke through to byte-identical determinism without sacrificing feature coverage. Happy to share the harness if it's useful to the project.

Replace the LLM-writes-spec procedure with a four-stage pipeline that produces a byte-deterministic spec. The skill spawns K subagents in parallel via the Agent tool, each producing a structured JSON inventory of the codebase; the skill then canonicalises each, merges them into a consensus, and translates that to the .allium spec via three pure-function Node scripts shipped with the plugin. Architecture: K subagents (Agent tool) | inventory-i.json (per subagent, follows references/inventory-schema.md) v scripts/canonicalize-inventory.mjs # convention normalisation v scripts/merge-inventories.mjs # K-way majority/modal consensus v scripts/inventory-to-spec.mjs # pure-function translator v allium check # validation The three scripts are pure functions: same K canonical inventories always produce a byte-identical merged inventory, which always produces a byte-identical .allium spec. The only non-determinism is the LLM stage producing the K raw inventories; consensus voting washes most of it out. Files: - skills/distill/SKILL.md rewritten as orchestrator (149 lines) - skills/distill/references/inventory-schema.md new (232 lines) - scripts/canonicalize-inventory.mjs new (230 lines) - scripts/merge-inventories.mjs new (201 lines) - scripts/inventory-to-spec.mjs new (553 lines) Measured on a fresh ~500 LOC insurance-claims Python fixture, comparing v3.3.0 (LLM writes the spec) against this pipeline: metric v3.3.0 baseline this PR ----------------------------+----------------+------------------- allium check 6/6 pass 6/6 pass pairwise spec diff median 487 lines byte-identical* entity-name Jaccard 0.75 1.00 rule-name Jaccard 0.28 1.00 9-pattern fixture coverage ~6/9 9/9 *for a fixed set of K canonical inventories. Across LLM rolls the K inventories vary slightly; consensus voting washes most of it out and the resulting specs are structurally identical with small body-only diffs. Cost: K x inference per invocation. Default K=3, configurable in the orchestrator. Wall time dominated by the slowest of K parallel subagents (~5-10 min on Opus for the fixture). Caveats: - Requires Node available on PATH (stdlib only, no extra deps). - Subagent prompt explicitly forbids invoking the distill skill itself (would recurse); uses general-purpose subagents reading the schema. - The 9/9 vs ~6/9 coverage gain is partly because the explicit schema forces the LLM to consider sections it might otherwise skip (value types, auxiliary enums, contracts as separate constructs from external entities, top-level invariants).

panayotovk · 2026-05-17T14:04:45Z

Flipping to draft while I rework the schema doc.

Quick note on what surfaced during follow-up testing: this PR was iterated on a single Python fixture (insurance-claims; 6/6 pass, 9/9 patterns, 0 errors). I subsequently authored a second Python fixture (a sell-side trading-risk desk, ~1,370 LOC) and ran the pipeline on it. The architecture (subagents → canonicalise → merge → translate) generalised — structural fidelity was good (7+2 entities, 3 contracts, 2 surfaces, 18 rules — all expected blocks present), and per-run reproducibility held. But the distilled spec failed allium check because the inventory schema doc this PR ships has crept toward Python-specific guidance (a Python → Allium type table, anti-patterns named after Python idioms like ternaries and %).

That's the wrong direction for a language-neutral spec language: it doesn't scale (every new source language would need its own table), and it puts the wrong burden on the schema doc.

Reworking the schema in a positive, language-neutral framing — describing what Allium accepts (expression grammar, type catalogue, construct catalogue) rather than what specific source languages don't have. A translation principle replaces the per-language anti-patterns: "the inventory contains only Allium constructs; if the source code uses an idiom with no direct Allium equivalent, model the behaviour in Allium, not the syntax" — e.g. a conditional value becomes a case expression or a derived property regardless of whether the source had ? :, if/else, ternaries, pattern matches, or anything else.

Plan before flipping back to ready-for-review:

Rewrite references/inventory-schema.md per the above.
Re-run both Python fixtures; verify both pass allium check.
Add a small fixture in a non-Python source language (probably TypeScript) as a generalisation smoke test.
Update this PR body to reflect the rework and the broader fixture coverage.

Will post back here when the rework lands. Happy to discuss the framing in advance if useful.

The first version of the schema doc had drifted toward Python-specific guidance (an explicit Python -> Allium type table; anti-patterns named after Python idioms like ternaries and `%`). That doesn't scale -- every new source language would need its own table -- and it's the wrong framing for a language-neutral spec language. This rewrite describes what Allium accepts (positively): - Translation principle stated upfront: the inventory contains only Allium constructs; if a source-language idiom has no direct Allium equivalent, model the behaviour in Allium, not the syntax. - Allium type catalogue with literal-format rules (Duration `14.days`, Decimal/Integer/String/Boolean/Timestamp, `?` suffix nullability). - Allium expression grammar listed positively: field access, `+ - * /`, comparison, `and/or/not/in`, named aggregations, `implies`. Nothing outside this list is an Allium expression. - Conditional values: forbidden inline (in any source-language form); must be modelled as derived properties or split transitions. - No-free-names rule with worked example: every identifier in an expression resolves through a declared param / let / field / `now` / `config.<name>`. - Chained field access: explicit rule with worked example -- always write the full param chain (`assessment.claim.last_activity_at`, not bare `claim.last_activity_at`). - Strict alphabetisation rules carried over from the original. Validated on three fixtures from three different domains (and now two source languages): fixture source LOC errors warnings info -----------------+----------+-----+--------+----------+------ insurance-claims Python 855 0 2 43 trading-risk Python 1367 0 7 64 build-pipeline TypeScript 573 0 1 34 All three pass `allium check` with the consensus pipeline. The TypeScript fixture is the cross-language smoke test; it passed on the first attempt with the rewritten schema, confirming the schema is genuinely source-language-neutral. The translator scripts (canonicalize-inventory.mjs, merge-inventories.mjs, inventory-to-spec.mjs) and the orchestrator SKILL.md are unchanged from the prior commit on this branch.

panayotovk · 2026-05-17T15:59:41Z

Rewrite landed in commit 6432de3.

Schema doc (skills/distill/references/inventory-schema.md) reworked end-to-end in language-neutral framing: describes what Allium accepts (type catalogue, expression grammar, construct catalogue) rather than what specific source languages don't have. The Python→Allium type table is gone. Per-language anti-patterns are gone. A single translation principle replaces them: "the inventory contains only Allium constructs; if a source-language idiom has no direct Allium equivalent, model the behaviour in Allium, not the syntax."

Validated on three fixtures across two source languages:

Fixture	Source	LOC	`allium check`
insurance-claims	Python	855	0E / 2W / 43I ✓
trading-risk	Python	1367	0E / 7W / 64I ✓
build-pipeline	TypeScript	573	0E / 1W / 34I ✓

The build-pipeline TypeScript fixture is the cross-language smoke test — it passed on the first attempt with the rewritten schema, which is the best evidence I have that the schema is genuinely language-neutral (and not just "didn't break Python").

Pipeline scripts and orchestrator SKILL.md are unchanged from the earlier commit on this branch. Only the schema doc moved.

Flipping back to ready-for-review.

Diagnostic harness that drives K-sample distillation runs against a fixture codebase, canonicalises each inventory, merges into a consensus inventory and translates to a single deterministic spec. Captures per-sample metadata for variance analysis between plugin variants. The harness expects sibling plugins/ and fixtures/ trees that are not present on this branch; this commit lands the eval scripts only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Includes inventories, canonical/merged inventories, per-sample spec.allium, consensus specs, meta.json invocation records and generated comparison reports across all timestamped runs. Excludes the ~340 MB of node_modules trees that accumulated under propagate test sandboxes — those are regenerable via npm install inside each workdir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

panayotovk marked this pull request as draft May 17, 2026 14:04

panayotovk marked this pull request as ready for review May 17, 2026 15:59

panayotovk mentioned this pull request May 17, 2026

Deterministic, language-agnostic propagate via inventory + translator + consensus #36

Open

Yavor Panayotov and others added 2 commits May 18, 2026 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distill: byte-deterministic consensus pipeline#35

distill: byte-deterministic consensus pipeline#35
panayotovk wants to merge 4 commits into
juxt:mainfrom
panayotovk:feat/deterministic-distill

panayotovk commented May 17, 2026

Uh oh!

panayotovk commented May 17, 2026

Uh oh!

panayotovk commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

panayotovk commented May 17, 2026

Summary

Architecture

Numbers

Files

Costs and caveats

Test plan

Background

Uh oh!

panayotovk commented May 17, 2026

Uh oh!

panayotovk commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant