feat: harden the offline LLM proposer subsystem (#363, #364, #365, #366, #367, #368, #374) by dgenio · Pull Request #462 · dgenio/ChainWeaver

dgenio · 2026-06-17T20:46:08Z

Summary

Makes the offline LLM proposers (llm_propose_flows, optimize_tool_descriptions) production-grade as one coherent change over a single subsystem (compiler_llm, optimizer, _offline_llm). Implements the seven issues in the recommended "harden the offline LLM proposer" group, built bottom-up in three commits so each layer is independently green. Everything is opt-in and back-compatible with the bare LLMFn seam, and the base package still imports no provider SDK.

Changes

Version proposer prompts and record provenance on LLM-generated proposals #364 prompt versioning + provenance — chainweaver/proposals.py: ModelInfo / ProposalProvenance (prompt name/version/SHA-256, caller-asserted model, params, repair usage, catalogue stats) attached to every proposal; write_proposals persists it and read_provenance reads it back. Templates pinned by PROMPT_VERSION + a hash-guard test.
Adopt structured-output contracts with a bounded repair loop for the offline LLM proposers #363 structured output + bounded repair — StructuredLLMFn protocol receives a published JSON-Schema envelope (schemas/proposal-{flows,descriptions}.schema.json); JSON-first parsing with YAML fallback; one-shot repair (max_repair_attempts, default 1) carrying the validation error.
Add adversarial test coverage for tool metadata rendered into proposer prompts #366 adversarial metadata hardening — render_tool_catalogue flattens control chars / Unicode separators and caps description length so hostile tool metadata can't break the one-entry-per-tool prompt structure (fixture + Hypothesis property coverage).
Introduce token budgeting and catalogue selection for proposer prompts #367 token budgeting — PromptBudget with error/truncate/batch/select overflow strategies, chars/4 estimator + pluggable token_counter, typed PromptBudgetExceededError (CW-E042) raised before any LLM call.
Ship optional LLMFn provider adapters with retry, timeout, and token accounting #368 provider adapters — chainweaver/integrations/llm_anthropic.py + llm_openai.py (and shared _llm_common.py) producing LLMFn/StructuredLLMFn with retry/backoff, timeout, max_calls/max_cost_usd ceilings (LLMProviderError CW-E043, LLMBudgetExceededError CW-E044), and a live .usage tally. New extras chainweaver[llm-anthropic] / [llm-openai] (the OpenAI adapter also covers OpenAI-compatible/local endpoints via base_url).
Add eval-driven tool-description optimization with measured routing outcomes #374 routing-accuracy evaluation — chainweaver/routing.py: RoutingCase, evaluate_routing (overall + per-tool accuracy, confusion pairs), mine_routing_cases; optimize_tool_descriptions(..., eval_cases=, routing_selector=) annotates each proposal with before/after per-tool selection accuracy.
Add a golden-dataset eval harness for the LLM-backed proposers #365 eval harnesses — evals/ tree (10 golden proposer cases + 20 routing cases, scorer, runner emitting results/latest.{json,md}) plus an opt-in .github/workflows/evals.yml (manual + weekly) that runs against a real provider via repo secrets; a deterministic stub exercises the harness in normal CI.
Governance — AGENTS.md repo map, CHANGELOG.md, docs/reference/error-table.md, and the public-API snapshot updated.

Testing

Linting passes (ruff check chainweaver/ tests/ examples/) — All checks passed
Formatting check passes (ruff format --check chainweaver/ tests/ examples/) — 244 files already formatted
Type checking passes (python -m mypy chainweaver/ tests/) — no issues in 207 source files
All existing tests pass (python -m pytest tests/) — 1780 passed, 1 skipped, total coverage 93.08% (gate 80%); new modules: proposals.py 100%, routing.py 97%, optimizer.py 98%
New tests added for new functionality — test_proposals.py, test_proposer_adversarial.py, test_routing.py, test_llm_adapters.py, test_evals_harness.py

All LLM calls in tests use scripted/fake stubs — no provider SDK is imported and no network call is made.

Related Issues

Closes #363
Closes #364
Closes #365
Closes #366
Closes #367
Closes #368
Closes #374

Checklist

Code follows project conventions (see AGENTS.md and docs/agent-context/)
Public API changes are documented (new symbols in __all__ + snapshot regenerated; CHANGELOG + AGENTS.md repo map + error-table updated)
No secrets or credentials included

Tradeoffs / risks

Two parse paths (JSON + YAML) are maintained; shared envelope coercion limits duplication.
Provider SDK churn is a maintenance cost; adapters are kept thin (one call site each) and pinned to floor versions. Cost estimates depend on the maintained price table (Implement provider price refresh and staleness checks for the cost table #351).
The batch overflow strategy can't propose a flow spanning tools in different batches — documented; prefer select for relevance-aware workloads.
Real-model eval quality/cost is non-deterministic, so that lane is manual/scheduled only and never a merge gate.

🤖 Generated with Claude Code

https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk

Generated by Claude Code

…t, adversarial hardening Hardens the offline LLM proposer subsystem (compiler_llm, optimizer): - #363 structured-output contracts + bounded repair: StructuredLLMFn seam, JSON-first parsing with YAML fallback, one-shot repair loop carrying the validation error; published envelope JSON schemas under schemas/. - #364 prompt versioning + provenance: PROMPT_NAME/PROMPT_VERSION pinned by a template-hash guard test; ProposalProvenance (prompt, model, params, repair usage, catalogue stats) attached to every proposal and persisted by write_proposals (read_provenance round-trips it). - #366 adversarial metadata hardening: render_tool_catalogue flattens control chars / Unicode separators and caps description length so hostile tool metadata cannot break the one-entry-per-tool structure (fixture + Hypothesis property coverage). - #367 token budgeting: PromptBudget with error/truncate/batch/select overflow strategies, chars/4 estimator + pluggable token_counter, catalogue stats in provenance. New error codes CW-E042 (PromptBudgetExceededError), CW-E043 (LLMProviderError), CW-E044 (LLMBudgetExceededError) registered and documented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk

…#368) Ship thin, optional-extra adapters that produce LLMFn/StructuredLLMFn callables for the offline proposers, keeping the base package free of provider SDKs: - chainweaver.integrations._llm_common: ProviderAdapter base with tenacity retry/backoff+jitter, per-call timeout, max_calls/max_cost_usd ceilings (priced from the maintained cost table), and a live LLMUsage tally. - llm_anthropic.anthropic_llm_fn / llm_openai.openai_llm_fn (the latter also covers OpenAI-compatible/local endpoints via base_url). Lazy SDK imports with the established 'install the extra' ImportError; structured-output via the json_schema seam. - New extras chainweaver[llm-anthropic], chainweaver[llm-openai]; mypy override so the absent SDKs don't fail the type check. Tests use fake clients (no SDK, no network); backoff sleep is patched out. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk

…esses (#365, #374) - #374 chainweaver/routing.py: RoutingCase, evaluate_routing (overall + per-tool accuracy, confusion pairs), and mine_routing_cases from agent traces. optimize_tool_descriptions(..., eval_cases=, routing_selector=) now annotates each proposal with before/after per-tool selection accuracy. - #365 evals/ harness (outside the package, like benchmarks/): 10 golden proposer cases + scorer (structural validity, expected-chain hit rate, hallucinated-tool rate, repair usage) + run_evals CLI emitting results/latest.{json,md}; 20 hand-authored routing cases + routing harness. - Opt-in .github/workflows/evals.yml (workflow_dispatch + weekly) runs the suite against a real provider via secrets; the stub model exercises the harness in normal CI (tests/test_evals_harness.py). - Docs/governance: AGENTS.md repo map, CHANGELOG, public-API snapshot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk

The proposer eval markdown report contains ✅/❌/… written as UTF-8, but the harness test read it back with read_text() (no encoding), which uses the platform default — cp1252 on Windows — and raised UnicodeDecodeError on all five windows-latest legs. Pin the read to utf-8 to match the write. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk

Copilot

Pull request overview

Hardens the build-time offline LLM proposer subsystem (flow proposer + tool-description optimizer) by adding shared structured-output/provenance/budget primitives, optional provider adapters, and eval tooling so proposer behavior is auditable, scalable to large catalogues, and measurable.

Changes:

Adds shared proposer primitives (proposals.py) and updates compiler_llm / optimizer to support structured outputs, bounded repair, provenance, and prompt budgeting.
Introduces optional Anthropic/OpenAI adapter shims with retry/timeout, spend ceilings, and usage accounting.
Adds routing-accuracy evaluation + golden eval harnesses (with an opt-in workflow), plus comprehensive tests and documentation updates.

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
`tests/test_routing.py`	Unit tests for routing-case mining + routing scorer + optimizer annotation.
`tests/test_proposer_adversarial.py`	Adversarial/property tests for prompt catalogue rendering hardening.
`tests/test_proposals.py`	Tests for provenance, structured parsing/repair, budgeting, and published schemas.
`tests/test_llm_adapters.py`	Tests for optional provider adapters (retry, ceilings, usage, import guards).
`tests/test_evals_harness.py`	Tests for eval harness datasets, scoring, and stub runs.
`tests/fixtures/public_api.json`	Updates public API snapshot for new exports and signature changes.
`schemas/proposal-flows.schema.json`	Published JSON Schema for flow proposal envelopes.
`schemas/proposal-descriptions.schema.json`	Published JSON Schema for description proposal envelopes.
`pyproject.toml`	Adds `llm-*` extras, adjusts pytest pythonpath, adds mypy overrides for optional SDKs.
`evals/run_evals.py`	CLI entry point to run evals via stub or real provider adapters.
`evals/routing/cases.yaml`	Golden routing-case dataset (hand-authored).
`evals/routing_harness.py`	Routing harness loader + deterministic keyword selector runner.
`evals/harness.py`	Golden proposer harness: load cases, run proposer, score, write reports, stub model.
`evals/cases/translate_pipeline.yaml`	Golden proposer case: translate flow.
`evals/cases/search_summarize.yaml`	Golden proposer case: search→summarize flow.
`evals/cases/mixed_catalogue.yaml`	Golden proposer case: mixed catalogue with distractor tools.
`evals/cases/log_pipeline.yaml`	Golden proposer case: log processing flow.
`evals/cases/image_pipeline.yaml`	Golden proposer case: image processing flow.
`evals/cases/fetch_parse_extract.yaml`	Golden proposer case: fetch→parse→extract flow.
`evals/cases/etl_pipeline.yaml`	Golden proposer case: ETL flow.
`evals/cases/distractor_fetch_parse.yaml`	Golden proposer case: fetch→parse with distractor tool.
`evals/cases/data_quality.yaml`	Golden proposer case: data-quality flow.
`evals/cases/ambiguous_lookup.yaml`	Golden proposer case: ambiguous lookup flow.
`evals/__init__.py`	Marks `evals/` as an importable package with rationale.
`docs/reference/error-table.md`	Documents new error codes for budgeting/provider adapter failures.
`CHANGELOG.md`	Changelog entry describing the hardened offline proposer subsystem.
`chainweaver/routing.py`	Adds routing cases + evaluation + case mining from traces.
`chainweaver/proposals.py`	Shared structured seam, provenance, repair loop, and prompt-budget planning.
`chainweaver/optimizer.py`	Integrates shared primitives; adds schema export + budgeting + routing accuracy annotation.
`chainweaver/integrations/llm_openai.py`	Optional OpenAI/OpenAI-compatible adapter implementing the proposer seam.
`chainweaver/integrations/llm_anthropic.py`	Optional Anthropic adapter implementing the proposer seam.
`chainweaver/integrations/_llm_common.py`	Shared adapter base: retry/backoff, ceilings, usage accounting, schema prompt augmentation.
`chainweaver/exceptions.py`	Adds typed errors for prompt budgeting and provider adapter failures + error codes.
`chainweaver/compiler_llm.py`	Integrates shared primitives; adds schema export, repair, budgeting, provenance persistence.
`chainweaver/_offline_llm.py`	Adds JSON-first parsing and hardens tool-catalogue rendering against hostile metadata.
`chainweaver/__init__.py`	Exposes new public symbols on the package surface (`__all__`).
`AGENTS.md`	Updates repo map for new proposer/routing modules and adapter integrations.
`.github/workflows/evals.yml`	Adds opt-in scheduled/manual workflow to run real-provider evals.

…on; flow vocabulary in evals Addresses PR #462 review (Copilot): - proposals.apply_budget: truncate/batch now raise PromptBudgetExceededError when even the smallest unit (a single capped tool) still exceeds max_tokens, so the 'budget enforced before any LLM call' contract holds for every overflow mode. - compiler_llm / optimizer: under batch/select overflow, validate each batch's proposals against the tools actually rendered into that batch's prompt (not the full catalogue), rejecting references to unshown tools and cross-batch flows. optimizer still sources original descriptions from the full map. - Domain vocabulary: rename the four *_pipeline eval cases to *_flow and switch the eval report column / stub strings from 'chain' to flow/sequence wording. - Tests: realistic truncate/batch budgets, plus new raise-path and out-of-batch-rejection coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk

dgenio · 2026-06-17T21:00:05Z

Thanks — addressed the review in 5187b3a.

Budget enforcement (proposals.py): truncate and batch now raise PromptBudgetExceededError when even the smallest unit (a single capped tool) still exceeds max_tokens, so the "budget enforced before any LLM call" guarantee holds for every overflow mode, not just error. Added raise-path tests.

Per-batch validation (compiler_llm.py, optimizer.py): under batch/select, proposals are now validated against the tools actually rendered into that batch's prompt rather than the full catalogue, so a completion can't reference an unshown tool (or build a cross-batch flow). The optimizer still sources each shown tool's original description from the full map. Added an out-of-batch-rejection test.

Vocabulary: renamed the four *_pipeline eval cases to *_flow, and switched the report column header + stub-proposal strings from "chain" to flow/sequence wording.

One deliberate exception: I kept the internal field names expected_chains / expected_chain_hit_rate. They mirror the expected_chains key in the #365 case-spec, and "tool chains" is the existing term in this subsystem's prompt template ('Propose deterministic tool chains ("flows")' in compiler_llm.py). Only the report-visible strings were changed. Happy to rename the fields too if you'd prefer full consistency.

All four checks pass locally (ruff, format, mypy, pytest — 1782 passed, 93% coverage). The earlier Windows failures were a separate UTF-8 read fix (afea512).

Generated by Claude Code

…rouping-9pbto2 # Conflicts: # chainweaver/__init__.py # tests/fixtures/public_api.json

Add test_budget_batch_raises_when_single_tool_overflows asserting PromptBudgetExceededError is raised before any LLM call when a single tool's rendered prompt exceeds max_tokens under overflow="batch" (the raise at proposals.py was previously untested; truncate had a twin test). Also add the missing 'from __future__ import annotations' to evals/__init__.py for consistency with the other eval modules.

claude added 3 commits June 17, 2026 19:26

Copilot AI review requested due to automatic review settings June 17, 2026 20:46

Copilot started reviewing on behalf of dgenio June 17, 2026 20:46 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

claude added 2 commits June 18, 2026 08:05

Merge remote-tracking branch 'origin/main' into claude/issue-triage-g…

3b86ed3

…rouping-9pbto2 # Conflicts: # chainweaver/__init__.py # tests/fixtures/public_api.json

dgenio merged commit a3c9cc3 into main Jun 18, 2026
20 checks passed

dgenio deleted the claude/issue-triage-grouping-9pbto2 branch June 18, 2026 08:23

dgenio mentioned this pull request Jun 18, 2026

Document and enforce PR scope + branch-naming conventions to cut multi-issue PRs, rebases, and post-review rework #463

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: harden the offline LLM proposer subsystem (#363, #364, #365, #366, #367, #368, #374)#462

feat: harden the offline LLM proposer subsystem (#363, #364, #365, #366, #367, #368, #374)#462
dgenio merged 7 commits into
mainfrom
claude/issue-triage-grouping-9pbto2

dgenio commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dgenio commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dgenio commented Jun 17, 2026

Summary

Changes

Testing

Related Issues

Checklist

Tradeoffs / risks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dgenio commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants