Skip to content

distill: byte-deterministic consensus pipeline#35

Open
panayotovk wants to merge 4 commits into
juxt:mainfrom
panayotovk:feat/deterministic-distill
Open

distill: byte-deterministic consensus pipeline#35
panayotovk wants to merge 4 commits into
juxt:mainfrom
panayotovk:feat/deterministic-distill

Conversation

@panayotovk
Copy link
Copy Markdown

Summary

Replace the distill skill's LLM-writes-spec procedure with a four-stage consensus pipeline. The skill spawns K subagents in parallel via the Agent tool, each producing a structured JSON inventory of the codebase; the skill then canonicalises each, merges them into a consensus inventory, and translates that to the .allium spec via three pure-function Node scripts shipped with the plugin.

Result: same K canonical inventories → byte-identical .allium spec. The only remaining non-determinism is in the LLM-driven inventory step itself, which consensus voting largely washes out.

Architecture

K subagents (Agent tool)
   |  inventory-i.json (each subagent reads references/inventory-schema.md)
   v
scripts/canonicalize-inventory.mjs   # convention normalisation
   v
scripts/merge-inventories.mjs        # K-way majority/modal consensus
   v
scripts/inventory-to-spec.mjs        # pure-function translator
   v
allium check                         # validation

The three scripts are pure functions of their inputs (Node, stdlib only — no extra deps). The orchestrator SKILL.md is 149 lines and is purely procedural; all schema/convention knowledge for subagents lives in skills/distill/references/inventory-schema.md.

Numbers

Measured on a freshly-authored ~500 LOC insurance-claims Python fixture exercising the nine patterns the skill needs to handle (status enums, guarded transitions, temporal rules, external webhook entity, third-party integration, implicit state machine, scattered logic, derived properties, FK→relationship).

Metric v3.3.0 baseline (LLM writes spec) This PR (consensus pipeline)
allium check 6/6 pass 6/6 pass
Pairwise spec-diff median 487 lines byte-identical*
Entity-name Jaccard 0.75 1.00
Rule-name Jaccard 0.28 1.00
Fixture pattern coverage ~6/9 9/9

*For a fixed set of K canonical inventories. Across LLM rolls the K inventories vary slightly; consensus voting washes most of it out and the resulting specs are structurally identical with small body-only diffs.

Files

  • skills/distill/SKILL.md — rewritten as the orchestrator (149 lines, down from 816)
  • skills/distill/references/inventory-schema.md — new (232 lines); the contract subagents follow
  • scripts/canonicalize-inventory.mjs — new (230 lines)
  • scripts/merge-inventories.mjs — new (201 lines)
  • scripts/inventory-to-spec.mjs — new (553 lines)

No other files touched.

Costs and caveats

  • K× inference per invocation, default K=3 (configurable in the orchestrator). Wall time dominated by the slowest of K parallel subagents (~5–10 min on Opus for the fixture).
  • Requires Node on PATH. Scripts are stdlib-only.
  • Subagent recursion prevention: subagents are general-purpose agents reading the inventory-schema doc directly, not invocations of the distill skill. The subagent prompt explicitly forbids invoking any skill.
  • Coverage gain (~6/9 → 9/9) is partly the explicit schema forcing the LLM to consider sections it might otherwise skip — value types, auxiliary enums, contracts as separate from external entities, top-level invariants.

Test plan

  • Distill a small Python service using the plugin; confirm ./allium-distilled/spec.allium is produced and allium check reports 0 errors.
  • Run distill twice on the same codebase; confirm specs are structurally identical (entity / rule / contract / surface name sets match exactly).
  • Run the three scripts standalone over the same inputs twice; confirm byte-identical outputs (md5sum).
  • Optional: run on a larger codebase (>2000 LOC) and confirm K=3 subagents complete within the per-invocation timeout.

Background

This came out of an A/B-harness experiment measuring distill determinism across nine iterations of SKILL.md tightenings on v3.3.0. The first eight iterations were LLM-writes-spec tweaks (best: 217-line pairwise diff median, all structural Jaccards at 1.00). The architectural pivot here — LLM produces the inventory, scripts produce the spec — was the only thing that broke through to byte-identical determinism without sacrificing feature coverage. Happy to share the harness if it's useful to the project.

Replace the LLM-writes-spec procedure with a four-stage pipeline that
produces a byte-deterministic spec. The skill spawns K subagents in
parallel via the Agent tool, each producing a structured JSON inventory
of the codebase; the skill then canonicalises each, merges them into a
consensus, and translates that to the .allium spec via three
pure-function Node scripts shipped with the plugin.

Architecture:

  K subagents (Agent tool)
     |  inventory-i.json (per subagent, follows references/inventory-schema.md)
     v
  scripts/canonicalize-inventory.mjs   # convention normalisation
     v
  scripts/merge-inventories.mjs        # K-way majority/modal consensus
     v
  scripts/inventory-to-spec.mjs        # pure-function translator
     v
  allium check                         # validation

The three scripts are pure functions: same K canonical inventories always
produce a byte-identical merged inventory, which always produces a
byte-identical .allium spec. The only non-determinism is the LLM stage
producing the K raw inventories; consensus voting washes most of it out.

Files:
- skills/distill/SKILL.md            rewritten as orchestrator (149 lines)
- skills/distill/references/inventory-schema.md   new (232 lines)
- scripts/canonicalize-inventory.mjs new (230 lines)
- scripts/merge-inventories.mjs      new (201 lines)
- scripts/inventory-to-spec.mjs      new (553 lines)

Measured on a fresh ~500 LOC insurance-claims Python fixture, comparing
v3.3.0 (LLM writes the spec) against this pipeline:

  metric                       v3.3.0 baseline   this PR
  ----------------------------+----------------+-------------------
  allium check                 6/6 pass          6/6 pass
  pairwise spec diff median    487 lines         byte-identical*
  entity-name Jaccard          0.75              1.00
  rule-name Jaccard            0.28              1.00
  9-pattern fixture coverage   ~6/9              9/9

  *for a fixed set of K canonical inventories. Across LLM rolls the K
  inventories vary slightly; consensus voting washes most of it out and
  the resulting specs are structurally identical with small body-only
  diffs.

Cost: K x inference per invocation. Default K=3, configurable in the
orchestrator. Wall time dominated by the slowest of K parallel
subagents (~5-10 min on Opus for the fixture).

Caveats:
- Requires Node available on PATH (stdlib only, no extra deps).
- Subagent prompt explicitly forbids invoking the distill skill itself
  (would recurse); uses general-purpose subagents reading the schema.
- The 9/9 vs ~6/9 coverage gain is partly because the explicit schema
  forces the LLM to consider sections it might otherwise skip
  (value types, auxiliary enums, contracts as separate constructs from
  external entities, top-level invariants).
@panayotovk panayotovk marked this pull request as draft May 17, 2026 14:04
@panayotovk
Copy link
Copy Markdown
Author

Flipping to draft while I rework the schema doc.

Quick note on what surfaced during follow-up testing: this PR was iterated on a single Python fixture (insurance-claims; 6/6 pass, 9/9 patterns, 0 errors). I subsequently authored a second Python fixture (a sell-side trading-risk desk, ~1,370 LOC) and ran the pipeline on it. The architecture (subagents → canonicalise → merge → translate) generalised — structural fidelity was good (7+2 entities, 3 contracts, 2 surfaces, 18 rules — all expected blocks present), and per-run reproducibility held. But the distilled spec failed allium check because the inventory schema doc this PR ships has crept toward Python-specific guidance (a Python → Allium type table, anti-patterns named after Python idioms like ternaries and %).

That's the wrong direction for a language-neutral spec language: it doesn't scale (every new source language would need its own table), and it puts the wrong burden on the schema doc.

Reworking the schema in a positive, language-neutral framing — describing what Allium accepts (expression grammar, type catalogue, construct catalogue) rather than what specific source languages don't have. A translation principle replaces the per-language anti-patterns: "the inventory contains only Allium constructs; if the source code uses an idiom with no direct Allium equivalent, model the behaviour in Allium, not the syntax" — e.g. a conditional value becomes a case expression or a derived property regardless of whether the source had ? :, if/else, ternaries, pattern matches, or anything else.

Plan before flipping back to ready-for-review:

  1. Rewrite references/inventory-schema.md per the above.
  2. Re-run both Python fixtures; verify both pass allium check.
  3. Add a small fixture in a non-Python source language (probably TypeScript) as a generalisation smoke test.
  4. Update this PR body to reflect the rework and the broader fixture coverage.

Will post back here when the rework lands. Happy to discuss the framing in advance if useful.

The first version of the schema doc had drifted toward Python-specific
guidance (an explicit Python -> Allium type table; anti-patterns named
after Python idioms like ternaries and `%`). That doesn't scale -- every
new source language would need its own table -- and it's the wrong
framing for a language-neutral spec language.

This rewrite describes what Allium accepts (positively):

- Translation principle stated upfront: the inventory contains only
  Allium constructs; if a source-language idiom has no direct Allium
  equivalent, model the behaviour in Allium, not the syntax.
- Allium type catalogue with literal-format rules (Duration `14.days`,
  Decimal/Integer/String/Boolean/Timestamp, `?` suffix nullability).
- Allium expression grammar listed positively: field access, `+ - * /`,
  comparison, `and/or/not/in`, named aggregations, `implies`. Nothing
  outside this list is an Allium expression.
- Conditional values: forbidden inline (in any source-language form);
  must be modelled as derived properties or split transitions.
- No-free-names rule with worked example: every identifier in an
  expression resolves through a declared param / let / field / `now` /
  `config.<name>`.
- Chained field access: explicit rule with worked example -- always
  write the full param chain (`assessment.claim.last_activity_at`,
  not bare `claim.last_activity_at`).
- Strict alphabetisation rules carried over from the original.

Validated on three fixtures from three different domains (and now two
source languages):

  fixture            source     LOC   errors   warnings   info
  -----------------+----------+-----+--------+----------+------
  insurance-claims   Python     855   0        2          43
  trading-risk       Python    1367   0        7          64
  build-pipeline     TypeScript  573   0        1          34

All three pass `allium check` with the consensus pipeline. The
TypeScript fixture is the cross-language smoke test; it passed on the
first attempt with the rewritten schema, confirming the schema is
genuinely source-language-neutral.

The translator scripts (canonicalize-inventory.mjs, merge-inventories.mjs,
inventory-to-spec.mjs) and the orchestrator SKILL.md are unchanged from
the prior commit on this branch.
@panayotovk
Copy link
Copy Markdown
Author

Rewrite landed in commit 6432de3.

Schema doc (skills/distill/references/inventory-schema.md) reworked end-to-end in language-neutral framing: describes what Allium accepts (type catalogue, expression grammar, construct catalogue) rather than what specific source languages don't have. The Python→Allium type table is gone. Per-language anti-patterns are gone. A single translation principle replaces them: "the inventory contains only Allium constructs; if a source-language idiom has no direct Allium equivalent, model the behaviour in Allium, not the syntax."

Validated on three fixtures across two source languages:

Fixture Source LOC allium check
insurance-claims Python 855 0E / 2W / 43I ✓
trading-risk Python 1367 0E / 7W / 64I ✓
build-pipeline TypeScript 573 0E / 1W / 34I ✓

The build-pipeline TypeScript fixture is the cross-language smoke test — it passed on the first attempt with the rewritten schema, which is the best evidence I have that the schema is genuinely language-neutral (and not just "didn't break Python").

Pipeline scripts and orchestrator SKILL.md are unchanged from the earlier commit on this branch. Only the schema doc moved.

Flipping back to ready-for-review.

Yavor Panayotov and others added 2 commits May 18, 2026 14:54
Diagnostic harness that drives K-sample distillation runs against a
fixture codebase, canonicalises each inventory, merges into a consensus
inventory and translates to a single deterministic spec. Captures
per-sample metadata for variance analysis between plugin variants.

The harness expects sibling plugins/ and fixtures/ trees that are not
present on this branch; this commit lands the eval scripts only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Includes inventories, canonical/merged inventories, per-sample spec.allium,
consensus specs, meta.json invocation records and generated comparison
reports across all timestamped runs. Excludes the ~340 MB of node_modules
trees that accumulated under propagate test sandboxes — those are
regenerable via npm install inside each workdir.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant