feat: harden the offline LLM proposer subsystem (#363, #364, #365, #366, #367, #368, #374)#462
Conversation
…t, adversarial hardening Hardens the offline LLM proposer subsystem (compiler_llm, optimizer): - #363 structured-output contracts + bounded repair: StructuredLLMFn seam, JSON-first parsing with YAML fallback, one-shot repair loop carrying the validation error; published envelope JSON schemas under schemas/. - #364 prompt versioning + provenance: PROMPT_NAME/PROMPT_VERSION pinned by a template-hash guard test; ProposalProvenance (prompt, model, params, repair usage, catalogue stats) attached to every proposal and persisted by write_proposals (read_provenance round-trips it). - #366 adversarial metadata hardening: render_tool_catalogue flattens control chars / Unicode separators and caps description length so hostile tool metadata cannot break the one-entry-per-tool structure (fixture + Hypothesis property coverage). - #367 token budgeting: PromptBudget with error/truncate/batch/select overflow strategies, chars/4 estimator + pluggable token_counter, catalogue stats in provenance. New error codes CW-E042 (PromptBudgetExceededError), CW-E043 (LLMProviderError), CW-E044 (LLMBudgetExceededError) registered and documented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk
…#368) Ship thin, optional-extra adapters that produce LLMFn/StructuredLLMFn callables for the offline proposers, keeping the base package free of provider SDKs: - chainweaver.integrations._llm_common: ProviderAdapter base with tenacity retry/backoff+jitter, per-call timeout, max_calls/max_cost_usd ceilings (priced from the maintained cost table), and a live LLMUsage tally. - llm_anthropic.anthropic_llm_fn / llm_openai.openai_llm_fn (the latter also covers OpenAI-compatible/local endpoints via base_url). Lazy SDK imports with the established 'install the extra' ImportError; structured-output via the json_schema seam. - New extras chainweaver[llm-anthropic], chainweaver[llm-openai]; mypy override so the absent SDKs don't fail the type check. Tests use fake clients (no SDK, no network); backoff sleep is patched out. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk
…esses (#365, #374) - #374 chainweaver/routing.py: RoutingCase, evaluate_routing (overall + per-tool accuracy, confusion pairs), and mine_routing_cases from agent traces. optimize_tool_descriptions(..., eval_cases=, routing_selector=) now annotates each proposal with before/after per-tool selection accuracy. - #365 evals/ harness (outside the package, like benchmarks/): 10 golden proposer cases + scorer (structural validity, expected-chain hit rate, hallucinated-tool rate, repair usage) + run_evals CLI emitting results/latest.{json,md}; 20 hand-authored routing cases + routing harness. - Opt-in .github/workflows/evals.yml (workflow_dispatch + weekly) runs the suite against a real provider via secrets; the stub model exercises the harness in normal CI (tests/test_evals_harness.py). - Docs/governance: AGENTS.md repo map, CHANGELOG, public-API snapshot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk
The proposer eval markdown report contains ✅/❌/… written as UTF-8, but the harness test read it back with read_text() (no encoding), which uses the platform default — cp1252 on Windows — and raised UnicodeDecodeError on all five windows-latest legs. Pin the read to utf-8 to match the write. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk
There was a problem hiding this comment.
Pull request overview
Hardens the build-time offline LLM proposer subsystem (flow proposer + tool-description optimizer) by adding shared structured-output/provenance/budget primitives, optional provider adapters, and eval tooling so proposer behavior is auditable, scalable to large catalogues, and measurable.
Changes:
- Adds shared proposer primitives (
proposals.py) and updatescompiler_llm/optimizerto support structured outputs, bounded repair, provenance, and prompt budgeting. - Introduces optional Anthropic/OpenAI adapter shims with retry/timeout, spend ceilings, and usage accounting.
- Adds routing-accuracy evaluation + golden eval harnesses (with an opt-in workflow), plus comprehensive tests and documentation updates.
Reviewed changes
Copilot reviewed 38 out of 38 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_routing.py |
Unit tests for routing-case mining + routing scorer + optimizer annotation. |
tests/test_proposer_adversarial.py |
Adversarial/property tests for prompt catalogue rendering hardening. |
tests/test_proposals.py |
Tests for provenance, structured parsing/repair, budgeting, and published schemas. |
tests/test_llm_adapters.py |
Tests for optional provider adapters (retry, ceilings, usage, import guards). |
tests/test_evals_harness.py |
Tests for eval harness datasets, scoring, and stub runs. |
tests/fixtures/public_api.json |
Updates public API snapshot for new exports and signature changes. |
schemas/proposal-flows.schema.json |
Published JSON Schema for flow proposal envelopes. |
schemas/proposal-descriptions.schema.json |
Published JSON Schema for description proposal envelopes. |
pyproject.toml |
Adds llm-* extras, adjusts pytest pythonpath, adds mypy overrides for optional SDKs. |
evals/run_evals.py |
CLI entry point to run evals via stub or real provider adapters. |
evals/routing/cases.yaml |
Golden routing-case dataset (hand-authored). |
evals/routing_harness.py |
Routing harness loader + deterministic keyword selector runner. |
evals/harness.py |
Golden proposer harness: load cases, run proposer, score, write reports, stub model. |
evals/cases/translate_pipeline.yaml |
Golden proposer case: translate flow. |
evals/cases/search_summarize.yaml |
Golden proposer case: search→summarize flow. |
evals/cases/mixed_catalogue.yaml |
Golden proposer case: mixed catalogue with distractor tools. |
evals/cases/log_pipeline.yaml |
Golden proposer case: log processing flow. |
evals/cases/image_pipeline.yaml |
Golden proposer case: image processing flow. |
evals/cases/fetch_parse_extract.yaml |
Golden proposer case: fetch→parse→extract flow. |
evals/cases/etl_pipeline.yaml |
Golden proposer case: ETL flow. |
evals/cases/distractor_fetch_parse.yaml |
Golden proposer case: fetch→parse with distractor tool. |
evals/cases/data_quality.yaml |
Golden proposer case: data-quality flow. |
evals/cases/ambiguous_lookup.yaml |
Golden proposer case: ambiguous lookup flow. |
evals/__init__.py |
Marks evals/ as an importable package with rationale. |
docs/reference/error-table.md |
Documents new error codes for budgeting/provider adapter failures. |
CHANGELOG.md |
Changelog entry describing the hardened offline proposer subsystem. |
chainweaver/routing.py |
Adds routing cases + evaluation + case mining from traces. |
chainweaver/proposals.py |
Shared structured seam, provenance, repair loop, and prompt-budget planning. |
chainweaver/optimizer.py |
Integrates shared primitives; adds schema export + budgeting + routing accuracy annotation. |
chainweaver/integrations/llm_openai.py |
Optional OpenAI/OpenAI-compatible adapter implementing the proposer seam. |
chainweaver/integrations/llm_anthropic.py |
Optional Anthropic adapter implementing the proposer seam. |
chainweaver/integrations/_llm_common.py |
Shared adapter base: retry/backoff, ceilings, usage accounting, schema prompt augmentation. |
chainweaver/exceptions.py |
Adds typed errors for prompt budgeting and provider adapter failures + error codes. |
chainweaver/compiler_llm.py |
Integrates shared primitives; adds schema export, repair, budgeting, provenance persistence. |
chainweaver/_offline_llm.py |
Adds JSON-first parsing and hardens tool-catalogue rendering against hostile metadata. |
chainweaver/__init__.py |
Exposes new public symbols on the package surface (__all__). |
AGENTS.md |
Updates repo map for new proposer/routing modules and adapter integrations. |
.github/workflows/evals.yml |
Adds opt-in scheduled/manual workflow to run real-provider evals. |
…on; flow vocabulary in evals Addresses PR #462 review (Copilot): - proposals.apply_budget: truncate/batch now raise PromptBudgetExceededError when even the smallest unit (a single capped tool) still exceeds max_tokens, so the 'budget enforced before any LLM call' contract holds for every overflow mode. - compiler_llm / optimizer: under batch/select overflow, validate each batch's proposals against the tools actually rendered into that batch's prompt (not the full catalogue), rejecting references to unshown tools and cross-batch flows. optimizer still sources original descriptions from the full map. - Domain vocabulary: rename the four *_pipeline eval cases to *_flow and switch the eval report column / stub strings from 'chain' to flow/sequence wording. - Tests: realistic truncate/batch budgets, plus new raise-path and out-of-batch-rejection coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk
|
Thanks — addressed the review in 5187b3a. Budget enforcement (proposals.py): Per-batch validation (compiler_llm.py, optimizer.py): under Vocabulary: renamed the four One deliberate exception: I kept the internal field names All four checks pass locally (ruff, format, mypy, pytest — 1782 passed, 93% coverage). The earlier Windows failures were a separate UTF-8 read fix (afea512). Generated by Claude Code |
…rouping-9pbto2 # Conflicts: # chainweaver/__init__.py # tests/fixtures/public_api.json
Add test_budget_batch_raises_when_single_tool_overflows asserting PromptBudgetExceededError is raised before any LLM call when a single tool's rendered prompt exceeds max_tokens under overflow="batch" (the raise at proposals.py was previously untested; truncate had a twin test). Also add the missing 'from __future__ import annotations' to evals/__init__.py for consistency with the other eval modules.
Summary
Makes the offline LLM proposers (
llm_propose_flows,optimize_tool_descriptions) production-grade as one coherent change over a single subsystem (compiler_llm,optimizer,_offline_llm). Implements the seven issues in the recommended "harden the offline LLM proposer" group, built bottom-up in three commits so each layer is independently green. Everything is opt-in and back-compatible with the bareLLMFnseam, and the base package still imports no provider SDK.Changes
chainweaver/proposals.py:ModelInfo/ProposalProvenance(prompt name/version/SHA-256, caller-asserted model, params, repair usage, catalogue stats) attached to every proposal;write_proposalspersists it andread_provenancereads it back. Templates pinned byPROMPT_VERSION+ a hash-guard test.StructuredLLMFnprotocol receives a published JSON-Schema envelope (schemas/proposal-{flows,descriptions}.schema.json); JSON-first parsing with YAML fallback; one-shot repair (max_repair_attempts, default 1) carrying the validation error.render_tool_catalogueflattens control chars / Unicode separators and caps description length so hostile tool metadata can't break the one-entry-per-tool prompt structure (fixture + Hypothesis property coverage).PromptBudgetwitherror/truncate/batch/selectoverflow strategies, chars/4 estimator + pluggabletoken_counter, typedPromptBudgetExceededError(CW-E042) raised before any LLM call.chainweaver/integrations/llm_anthropic.py+llm_openai.py(and shared_llm_common.py) producingLLMFn/StructuredLLMFnwith retry/backoff, timeout,max_calls/max_cost_usdceilings (LLMProviderErrorCW-E043,LLMBudgetExceededErrorCW-E044), and a live.usagetally. New extraschainweaver[llm-anthropic]/[llm-openai](the OpenAI adapter also covers OpenAI-compatible/local endpoints viabase_url).chainweaver/routing.py:RoutingCase,evaluate_routing(overall + per-tool accuracy, confusion pairs),mine_routing_cases;optimize_tool_descriptions(..., eval_cases=, routing_selector=)annotates each proposal with before/after per-tool selection accuracy.evals/tree (10 golden proposer cases + 20 routing cases, scorer, runner emittingresults/latest.{json,md}) plus an opt-in.github/workflows/evals.yml(manual + weekly) that runs against a real provider via repo secrets; a deterministic stub exercises the harness in normal CI.AGENTS.mdrepo map,CHANGELOG.md,docs/reference/error-table.md, and the public-API snapshot updated.Testing
ruff check chainweaver/ tests/ examples/) — All checks passedruff format --check chainweaver/ tests/ examples/) — 244 files already formattedpython -m mypy chainweaver/ tests/) — no issues in 207 source filespython -m pytest tests/) — 1780 passed, 1 skipped, total coverage 93.08% (gate 80%); new modules:proposals.py100%,routing.py97%,optimizer.py98%test_proposals.py,test_proposer_adversarial.py,test_routing.py,test_llm_adapters.py,test_evals_harness.pyAll LLM calls in tests use scripted/fake stubs — no provider SDK is imported and no network call is made.
Related Issues
Closes #363
Closes #364
Closes #365
Closes #366
Closes #367
Closes #368
Closes #374
Checklist
AGENTS.mdanddocs/agent-context/)__all__+ snapshot regenerated; CHANGELOG + AGENTS.md repo map + error-table updated)Tradeoffs / risks
batchoverflow strategy can't propose a flow spanning tools in different batches — documented; preferselectfor relevance-aware workloads.🤖 Generated with Claude Code
https://claude.ai/code/session_01DJKNoGApcS1joM2EzDe2Bk
Generated by Claude Code