Skip to content

feat: harden the offline LLM proposer subsystem (#363, #364, #365, #366, #367, #368, #374)#462

Merged
dgenio merged 7 commits into
mainfrom
claude/issue-triage-grouping-9pbto2
Jun 18, 2026
Merged

feat: harden the offline LLM proposer subsystem (#363, #364, #365, #366, #367, #368, #374)#462
dgenio merged 7 commits into
mainfrom
claude/issue-triage-grouping-9pbto2

Conversation

@dgenio

@dgenio dgenio commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Summary

Makes the offline LLM proposers (llm_propose_flows, optimize_tool_descriptions) production-grade as one coherent change over a single subsystem (compiler_llm, optimizer, _offline_llm). Implements the seven issues in the recommended "harden the offline LLM proposer" group, built bottom-up in three commits so each layer is independently green. Everything is opt-in and back-compatible with the bare LLMFn seam, and the base package still imports no provider SDK.

Changes

Testing

  • Linting passes (ruff check chainweaver/ tests/ examples/) — All checks passed
  • Formatting check passes (ruff format --check chainweaver/ tests/ examples/) — 244 files already formatted
  • Type checking passes (python -m mypy chainweaver/ tests/) — no issues in 207 source files
  • All existing tests pass (python -m pytest tests/) — 1780 passed, 1 skipped, total coverage 93.08% (gate 80%); new modules: proposals.py 100%, routing.py 97%, optimizer.py 98%
  • New tests added for new functionality — test_proposals.py, test_proposer_adversarial.py, test_routing.py, test_llm_adapters.py, test_evals_harness.py

All LLM calls in tests use scripted/fake stubs — no provider SDK is imported and no network call is made.

Related Issues

Closes #363
Closes #364
Closes #365
Closes #366
Closes #367
Closes #368
Closes #374

Checklist

  • Code follows project conventions (see AGENTS.md and docs/agent-context/)
  • Public API changes are documented (new symbols in __all__ + snapshot regenerated; CHANGELOG + AGENTS.md repo map + error-table updated)
  • No secrets or credentials included

Tradeoffs / risks

  • Two parse paths (JSON + YAML) are maintained; shared envelope coercion limits duplication.
  • Provider SDK churn is a maintenance cost; adapters are kept thin (one call site each) and pinned to floor versions. Cost estimates depend on the maintained price table (Implement provider price refresh and staleness checks for the cost table #351).
  • The batch overflow strategy can't propose a flow spanning tools in different batches — documented; prefer select for relevance-aware workloads.
  • Real-model eval quality/cost is non-deterministic, so that lane is manual/scheduled only and never a merge gate.

🤖 Generated with Claude Code

https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk


Generated by Claude Code

claude added 3 commits June 17, 2026 19:26
…t, adversarial hardening

Hardens the offline LLM proposer subsystem (compiler_llm, optimizer):

- #363 structured-output contracts + bounded repair: StructuredLLMFn seam,
  JSON-first parsing with YAML fallback, one-shot repair loop carrying the
  validation error; published envelope JSON schemas under schemas/.
- #364 prompt versioning + provenance: PROMPT_NAME/PROMPT_VERSION pinned by a
  template-hash guard test; ProposalProvenance (prompt, model, params, repair
  usage, catalogue stats) attached to every proposal and persisted by
  write_proposals (read_provenance round-trips it).
- #366 adversarial metadata hardening: render_tool_catalogue flattens control
  chars / Unicode separators and caps description length so hostile tool
  metadata cannot break the one-entry-per-tool structure (fixture + Hypothesis
  property coverage).
- #367 token budgeting: PromptBudget with error/truncate/batch/select overflow
  strategies, chars/4 estimator + pluggable token_counter, catalogue stats in
  provenance.

New error codes CW-E042 (PromptBudgetExceededError), CW-E043 (LLMProviderError),
CW-E044 (LLMBudgetExceededError) registered and documented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk
…#368)

Ship thin, optional-extra adapters that produce LLMFn/StructuredLLMFn callables
for the offline proposers, keeping the base package free of provider SDKs:

- chainweaver.integrations._llm_common: ProviderAdapter base with tenacity
  retry/backoff+jitter, per-call timeout, max_calls/max_cost_usd ceilings
  (priced from the maintained cost table), and a live LLMUsage tally.
- llm_anthropic.anthropic_llm_fn / llm_openai.openai_llm_fn (the latter also
  covers OpenAI-compatible/local endpoints via base_url). Lazy SDK imports with
  the established 'install the extra' ImportError; structured-output via the
  json_schema seam.
- New extras chainweaver[llm-anthropic], chainweaver[llm-openai]; mypy override
  so the absent SDKs don't fail the type check.

Tests use fake clients (no SDK, no network); backoff sleep is patched out.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk
…esses (#365, #374)

- #374 chainweaver/routing.py: RoutingCase, evaluate_routing (overall + per-tool
  accuracy, confusion pairs), and mine_routing_cases from agent traces.
  optimize_tool_descriptions(..., eval_cases=, routing_selector=) now annotates
  each proposal with before/after per-tool selection accuracy.
- #365 evals/ harness (outside the package, like benchmarks/): 10 golden
  proposer cases + scorer (structural validity, expected-chain hit rate,
  hallucinated-tool rate, repair usage) + run_evals CLI emitting results/latest.{json,md};
  20 hand-authored routing cases + routing harness.
- Opt-in .github/workflows/evals.yml (workflow_dispatch + weekly) runs the suite
  against a real provider via secrets; the stub model exercises the harness in
  normal CI (tests/test_evals_harness.py).
- Docs/governance: AGENTS.md repo map, CHANGELOG, public-API snapshot.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk
Copilot AI review requested due to automatic review settings June 17, 2026 20:46
The proposer eval markdown report contains ✅/❌/… written as UTF-8, but the
harness test read it back with read_text() (no encoding), which uses the
platform default — cp1252 on Windows — and raised UnicodeDecodeError on all
five windows-latest legs. Pin the read to utf-8 to match the write.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens the build-time offline LLM proposer subsystem (flow proposer + tool-description optimizer) by adding shared structured-output/provenance/budget primitives, optional provider adapters, and eval tooling so proposer behavior is auditable, scalable to large catalogues, and measurable.

Changes:

  • Adds shared proposer primitives (proposals.py) and updates compiler_llm / optimizer to support structured outputs, bounded repair, provenance, and prompt budgeting.
  • Introduces optional Anthropic/OpenAI adapter shims with retry/timeout, spend ceilings, and usage accounting.
  • Adds routing-accuracy evaluation + golden eval harnesses (with an opt-in workflow), plus comprehensive tests and documentation updates.

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tests/test_routing.py Unit tests for routing-case mining + routing scorer + optimizer annotation.
tests/test_proposer_adversarial.py Adversarial/property tests for prompt catalogue rendering hardening.
tests/test_proposals.py Tests for provenance, structured parsing/repair, budgeting, and published schemas.
tests/test_llm_adapters.py Tests for optional provider adapters (retry, ceilings, usage, import guards).
tests/test_evals_harness.py Tests for eval harness datasets, scoring, and stub runs.
tests/fixtures/public_api.json Updates public API snapshot for new exports and signature changes.
schemas/proposal-flows.schema.json Published JSON Schema for flow proposal envelopes.
schemas/proposal-descriptions.schema.json Published JSON Schema for description proposal envelopes.
pyproject.toml Adds llm-* extras, adjusts pytest pythonpath, adds mypy overrides for optional SDKs.
evals/run_evals.py CLI entry point to run evals via stub or real provider adapters.
evals/routing/cases.yaml Golden routing-case dataset (hand-authored).
evals/routing_harness.py Routing harness loader + deterministic keyword selector runner.
evals/harness.py Golden proposer harness: load cases, run proposer, score, write reports, stub model.
evals/cases/translate_pipeline.yaml Golden proposer case: translate flow.
evals/cases/search_summarize.yaml Golden proposer case: search→summarize flow.
evals/cases/mixed_catalogue.yaml Golden proposer case: mixed catalogue with distractor tools.
evals/cases/log_pipeline.yaml Golden proposer case: log processing flow.
evals/cases/image_pipeline.yaml Golden proposer case: image processing flow.
evals/cases/fetch_parse_extract.yaml Golden proposer case: fetch→parse→extract flow.
evals/cases/etl_pipeline.yaml Golden proposer case: ETL flow.
evals/cases/distractor_fetch_parse.yaml Golden proposer case: fetch→parse with distractor tool.
evals/cases/data_quality.yaml Golden proposer case: data-quality flow.
evals/cases/ambiguous_lookup.yaml Golden proposer case: ambiguous lookup flow.
evals/__init__.py Marks evals/ as an importable package with rationale.
docs/reference/error-table.md Documents new error codes for budgeting/provider adapter failures.
CHANGELOG.md Changelog entry describing the hardened offline proposer subsystem.
chainweaver/routing.py Adds routing cases + evaluation + case mining from traces.
chainweaver/proposals.py Shared structured seam, provenance, repair loop, and prompt-budget planning.
chainweaver/optimizer.py Integrates shared primitives; adds schema export + budgeting + routing accuracy annotation.
chainweaver/integrations/llm_openai.py Optional OpenAI/OpenAI-compatible adapter implementing the proposer seam.
chainweaver/integrations/llm_anthropic.py Optional Anthropic adapter implementing the proposer seam.
chainweaver/integrations/_llm_common.py Shared adapter base: retry/backoff, ceilings, usage accounting, schema prompt augmentation.
chainweaver/exceptions.py Adds typed errors for prompt budgeting and provider adapter failures + error codes.
chainweaver/compiler_llm.py Integrates shared primitives; adds schema export, repair, budgeting, provenance persistence.
chainweaver/_offline_llm.py Adds JSON-first parsing and hardens tool-catalogue rendering against hostile metadata.
chainweaver/__init__.py Exposes new public symbols on the package surface (__all__).
AGENTS.md Updates repo map for new proposer/routing modules and adapter integrations.
.github/workflows/evals.yml Adds opt-in scheduled/manual workflow to run real-provider evals.

Comment thread chainweaver/proposals.py
Comment thread chainweaver/proposals.py
Comment thread chainweaver/compiler_llm.py Outdated
Comment thread chainweaver/optimizer.py Outdated
Comment thread evals/cases/etl_pipeline.yaml Outdated
Comment thread evals/cases/log_pipeline.yaml Outdated
Comment thread evals/cases/translate_pipeline.yaml Outdated
Comment thread evals/harness.py Outdated
Comment thread evals/harness.py
Comment thread evals/harness.py
…on; flow vocabulary in evals

Addresses PR #462 review (Copilot):

- proposals.apply_budget: truncate/batch now raise PromptBudgetExceededError when
  even the smallest unit (a single capped tool) still exceeds max_tokens, so the
  'budget enforced before any LLM call' contract holds for every overflow mode.
- compiler_llm / optimizer: under batch/select overflow, validate each batch's
  proposals against the tools actually rendered into that batch's prompt (not the
  full catalogue), rejecting references to unshown tools and cross-batch flows.
  optimizer still sources original descriptions from the full map.
- Domain vocabulary: rename the four *_pipeline eval cases to *_flow and switch
  the eval report column / stub strings from 'chain' to flow/sequence wording.
- Tests: realistic truncate/batch budgets, plus new raise-path and
  out-of-batch-rejection coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk

dgenio commented Jun 17, 2026

Copy link
Copy Markdown
Owner Author

Thanks — addressed the review in 5187b3a.

Budget enforcement (proposals.py): truncate and batch now raise PromptBudgetExceededError when even the smallest unit (a single capped tool) still exceeds max_tokens, so the "budget enforced before any LLM call" guarantee holds for every overflow mode, not just error. Added raise-path tests.

Per-batch validation (compiler_llm.py, optimizer.py): under batch/select, proposals are now validated against the tools actually rendered into that batch's prompt rather than the full catalogue, so a completion can't reference an unshown tool (or build a cross-batch flow). The optimizer still sources each shown tool's original description from the full map. Added an out-of-batch-rejection test.

Vocabulary: renamed the four *_pipeline eval cases to *_flow, and switched the report column header + stub-proposal strings from "chain" to flow/sequence wording.

One deliberate exception: I kept the internal field names expected_chains / expected_chain_hit_rate. They mirror the expected_chains key in the #365 case-spec, and "tool chains" is the existing term in this subsystem's prompt template ('Propose deterministic tool chains ("flows")' in compiler_llm.py). Only the report-visible strings were changed. Happy to rename the fields too if you'd prefer full consistency.

All four checks pass locally (ruff, format, mypy, pytest — 1782 passed, 93% coverage). The earlier Windows failures were a separate UTF-8 read fix (afea512).


Generated by Claude Code

claude added 2 commits June 18, 2026 08:05
…rouping-9pbto2

# Conflicts:
#	chainweaver/__init__.py
#	tests/fixtures/public_api.json
Add test_budget_batch_raises_when_single_tool_overflows asserting
PromptBudgetExceededError is raised before any LLM call when a single
tool's rendered prompt exceeds max_tokens under overflow="batch" (the
raise at proposals.py was previously untested; truncate had a twin test).
Also add the missing 'from __future__ import annotations' to
evals/__init__.py for consistency with the other eval modules.
@dgenio dgenio merged commit a3c9cc3 into main Jun 18, 2026
20 checks passed
@dgenio dgenio deleted the claude/issue-triage-grouping-9pbto2 branch June 18, 2026 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment