diff --git a/docs/analysis/REPO_ANALYSIS_2026-05-07.md b/docs/analysis/REPO_ANALYSIS_2026-05-07.md new file mode 100644 index 0000000..53c644f --- /dev/null +++ b/docs/analysis/REPO_ANALYSIS_2026-05-07.md @@ -0,0 +1,612 @@ +# ExecutionKit — Deep Repo Analysis + +**Date:** 2026-05-07 +**Scope:** `TARGET_ENV=local-dev`, `INCLUDE_FRONTEND=false`, `INCLUDE_RESEARCH_DIR=false`, `MAX_FILES_DEEP_READ=50`, `COVERAGE_THRESHOLD=80`, `LANGUAGE_ECOSYSTEM=python` +**Methodology:** 5 parallel subagents (Topology, Architecture, Code Health, Test Quality, Security+DX) + orchestrator synthesis. + +--- + +## 1. Executive Summary + +1. **Zero mandatory runtime dependencies with full mypy --strict compliance** — the library ships 2,836 LOC of pure-stdlib Python across 20 modules, passes mypy strict in all of them, and produces only 1 ruff error (0.35/KLOC); the foundation is exceptionally solid. [`executionkit/__init__.py:49`, `pyproject.toml:21`] +2. **`provider.py` is simultaneously the architectural hub and the largest coverage gap** — imported by 9 of 20 modules, it is the central seam for HTTP, error mapping, and protocol definitions, yet sits at **58% coverage** because the live HTTP transport paths (`_post_httpx`, `_post_urllib`, lines 258–378) have no mock-HTTP tests. [`executionkit/provider.py`, §5.1] +3. **The four core pattern functions are monolithic state machines that dominate complexity** — `refine_loop` (182 lines, CC=12), `react_loop` (161 lines, CC=20), `consensus` (112 lines, CC=12), `structured` (88 lines) each mix budget management, retry orchestration, loop control, and response parsing; `react_loop` at CC=20 is the highest-risk function in the codebase. [`executionkit/patterns/react_loop.py:149`, §4.2–4.3] +4. **57 redundant `@pytest.mark.asyncio` decorators and a CI pytest command that will attempt integration tests without API keys** — `asyncio_mode = "auto"` in `pyproject.toml:65` makes all async-test marks redundant, and `.github/workflows/ci.yml:40` runs `pytest tests/` without `-m "not integration"`, meaning any future `@pytest.mark.integration` test without a graceful skip guard will fail every CI run. [§5.3, §5.4] +5. **`checked_complete` and `validate_score` are low-level internals contractually published in `__all__`**, creating premature API obligations; three public functions (`release_call`, `last_call`, `assistant_message`) have zero call sites in either source or tests. [`executionkit/__init__.py:77,89`, §3.4, §4.7] + +--- + +## 2. Repo Topology + +### 2.1 Package Inventory + +**Source modules** (`executionkit/`, flat layout): + +| Module | LOC | Key Public Exports | External Deps | +|--------|----:|-------------------|---------------| +| `__init__.py` | 163 | 38 symbols in `__all__` (all patterns, types, errors, Kit, sync wrappers) | 0 | +| `_mock.py` | 85 | `MockProvider` | 0 | +| `compose.py` | 156 | `pipe`, `pipe_sync`, `PatternStep` | 0 | +| `cost.py` | 81 | `CostTracker` | 0 | +| `errors.py` | 74 | 9 exception classes | 0 | +| `kit.py` | 98 | `Kit` | 0 | +| `provider.py` | **502** | `LLMProvider`, `ToolCallingProvider`, `Provider`, `LLMResponse`, `ToolCall`, 9 re-exported error classes | `httpx` (optional) | +| `types.py` | 122 | `TokenUsage`, `PatternResult`, `Tool`, `VotingStrategy`, `Evaluator` | 0 | +| `engine/__init__.py` | 21 | 9 engine symbols | 0 | +| `engine/convergence.py` | 82 | `ConvergenceDetector` | 0 | +| `engine/json_extraction.py` | 138 | `extract_json` | 0 | +| `engine/messages.py` | 15 | `user_message`, `assistant_message` | 0 | +| `engine/parallel.py` | 80 | `gather_strict`, `gather_resilient` | 0 | +| `engine/retry.py` | 115 | `RetryConfig`, `with_retry`, `DEFAULT_RETRY` | 0 | +| `patterns/__init__.py` | 15 | 4 pattern module refs | 0 | +| `patterns/base.py` | 232 | `checked_complete`, `validate_score` | 0 | +| `patterns/consensus.py` | 135 | `consensus` | 0 | +| `patterns/react_loop.py` | **358** | `react_loop` | 0 | +| `patterns/refine_loop.py` | 237 | `refine_loop` | 0 | +| `patterns/structured.py` | 127 | `structured` | 0 | +| **TOTAL** | **2,836** | | **1 optional** | + +**Tests** (`tests/`): + +| File | LOC | `def test_` count | +|------|----:|------------------:| +| `conftest.py` | 56 | 0 | +| `test_compose.py` | 250 | 13 | +| `test_concurrency.py` | 230 | 14 | +| `test_engine.py` | 661 | 86 | +| `test_exports.py` | 59 | 2 | +| `test_kit.py` | 283 | 15 | +| `test_patterns.py` | 1,715 | 109 | +| `test_provider.py` | 686 | 72 | +| `test_sync_and_parse.py` | 193 | 20 | +| `test_sync_wrappers.py` | 47 | 5 | +| `test_types.py` | 327 | 43 | +| **TOTAL** | **4,507** | **379** | + +**Test-to-source ratio:** 4,507 / 2,836 = **1.59×** (healthy for a library). + +### 2.2 Internal Dependency Graph + +```mermaid +graph TD + INIT[__init__.py] --> _mock + INIT --> compose + INIT --> cost + INIT --> kit + INIT --> engine_init[engine/__init__.py] + INIT --> patterns_base[patterns/base.py] + INIT --> patterns_consensus[patterns/consensus.py] + INIT --> patterns_react[patterns/react_loop.py] + INIT --> patterns_refine[patterns/refine_loop.py] + INIT --> patterns_structured[patterns/structured.py] + INIT --> provider + INIT --> types + + kit --> compose + kit --> cost + kit --> patterns_consensus + kit --> patterns_react + kit --> patterns_refine + kit --> provider + kit --> types + + compose --> provider + compose --> types + + provider --> errors + errors --> types + cost --> types + + _mock --> provider + + engine_init --> engine_conv[engine/convergence.py] + engine_init --> engine_json[engine/json_extraction.py] + engine_init --> engine_msg[engine/messages.py] + engine_init --> engine_par[engine/parallel.py] + engine_init --> engine_retry[engine/retry.py] + + engine_retry --> provider + + patterns_init[patterns/__init__.py] --> patterns_consensus + patterns_init --> patterns_react + patterns_init --> patterns_refine + patterns_init --> patterns_structured + + patterns_base --> cost + patterns_base --> engine_retry + patterns_base --> provider + patterns_base --> types + + patterns_consensus --> cost + patterns_consensus --> engine_msg + patterns_consensus --> engine_par + patterns_consensus --> engine_retry + patterns_consensus --> patterns_base + patterns_consensus --> provider + patterns_consensus --> types + + patterns_react --> cost + patterns_react --> engine_retry + patterns_react --> patterns_base + patterns_react --> provider + patterns_react --> types + + patterns_refine --> cost + patterns_refine --> engine_conv + patterns_refine --> engine_retry + patterns_refine --> patterns_base + patterns_refine --> provider + patterns_refine --> types + + patterns_structured --> cost + patterns_structured --> engine_json + patterns_structured --> engine_msg + patterns_structured --> engine_retry + patterns_structured --> errors + patterns_structured --> patterns_base + patterns_structured --> provider + patterns_structured --> types +``` + +**Key observations:** +- `types.py` has zero internal imports — pure leaf node. +- `errors.py` imports only `types.py` — second-lowest node. +- `provider.py` is imported by 9 of 20 modules — the architectural hub. +- All four pattern modules share identical fan-in: `patterns/base.py` → `provider`, `cost`, `engine/retry`, `types`. +- `engine/retry.py → provider` is the only upward coupling; it is documented and intentional. + +### 2.3 External Dependency Concentration + +**Zero mandatory runtime dependencies.** `pyproject.toml:21`: `dependencies = []`. + +| Dep | Files | Usage | +|-----|-------|-------| +| `httpx>=0.27` | `provider.py:43` | Optional extra (`[httpx]`); probed at import time via `try: import httpx`; used as `AsyncClient` HTTP backend; falls back to `urllib`. | + +All other imports are stdlib: `asyncio`, `collections`, `dataclasses`, `enum`, `inspect`, `itertools`, `json`, `logging`, `math`, `re`, `random`, `types`, `typing`, `urllib`, `warnings`. + +### 2.4 Language and Framework Mix + +| Metric | Value | Source | +|--------|-------|--------| +| Python required | `>=3.11` | `pyproject.toml:10` | +| CI Python matrix | 3.11, 3.12, 3.13 | `.github/workflows/ci.yml:16` | +| CI OS matrix | ubuntu-latest, windows-latest | `.github/workflows/ci.yml:15` | +| Language mix | 100% Python (source) | cloc-equivalent | +| Total Python LOC (all) | 7,788 | source + tests + examples | +| Build backend | hatchling | `pyproject.toml:3` | + +**Dev toolchain:** + +| Tool | Version | Role | +|------|---------|------| +| `ruff` | `>=0.14.0,<0.16` | Lint + format | +| `mypy` | `>=1.18` | Type check (`strict=true`) | +| `pytest` | `>=8.4` | Test runner | +| `pytest-asyncio` | `>=1.2` | Async test support (`asyncio_mode="auto"`) | +| `pytest-cov` | `>=7.0` | Coverage (`fail_under=80`) | +| `bandit[toml]` | `>=1.8` | SAST scanner | +| `httpx` | `>=0.27` | HTTP backend (dev: exercises httpx paths in tests) | +| `build` | `>=1.2` | Wheel/sdist builder | + +--- + +## 3. Architecture Assessment + +### 3.1 Entrypoints + +| Entrypoint | Type | Path | Invocation | +|-----------|------|------|------------| +| `consensus` | async pattern | `executionkit.patterns.consensus` | `await consensus(provider, prompt)` | +| `refine_loop` | async pattern | `executionkit.patterns.refine_loop` | `await refine_loop(provider, prompt)` | +| `react_loop` | async pattern | `executionkit.patterns.react_loop` | `await react_loop(provider, prompt, tools=[...])` | +| `structured` | async pattern | `executionkit.patterns.structured` | `await structured(provider, prompt, schema)` | +| `pipe` | async composition | `executionkit.compose` | `await pipe(provider, [step1, step2])` | +| `*_sync` (×5) | sync wrappers | `executionkit/__init__.py:118–163` | `consensus_sync(provider, prompt)` — raises `RuntimeError` inside event loop | +| `Kit` | session facade | `executionkit.kit` | `Kit(provider).refine(prompt)` | +| `checked_complete` | low-level (exposed) | `executionkit.patterns.base` | `await checked_complete(provider, messages, tracker, ...)` | +| Examples (×6) | standalone scripts | `examples/*.py` | `python examples/quickstart_openai.py` | + +No CLI scripts. `pyproject.toml` has no `[project.scripts]` section. + +### 3.2 Layering + +Stated architecture (`docs/architecture.md:67–79`) requires a strict downward dependency flow: +``` +engine/* → provider (retry only, documented exception) +patterns/* → cost, engine/*, provider, types +provider → types, errors +errors → types +``` + +**Verified:** No violations found. +- `engine/` → `patterns/`: zero matches (confirmed). +- `provider.py` → `patterns/`: zero matches (confirmed). +- `engine/retry.py:14` imports `ProviderError, RateLimitError` from `provider` — documented intentional coupling, not an oversight. + +Layering is **clean and matches the stated architecture**. + +### 3.3 Seam Integrity + +**LLM Provider seam:** `LLMProvider` and `ToolCallingProvider` are `@runtime_checkable` protocols at `provider.py:118,137`. Structural typing only — no inheritance required. `_TrackedProvider` correctly delegates `supports_tools` via `getattr` (`patterns/base.py:172`) rather than hardcoding. Seam is clean. + +**Error handling seam:** `_classify_http_error()` at `provider.py:452–485` is the single HTTP status → exception mapping point. Both `_post_httpx` (`provider.py:286`) and `_post_urllib` (`provider.py:322`) call it exclusively. No duplicated status-to-error mapping exists elsewhere. + +**Cost tracking seam:** `CostTracker` is mutated only through `checked_complete` (`patterns/base.py:139,148`) for two-phase TOCTOU-safe accounting. No pattern directly mutates `CostTracker._calls`. Seam is intact. + +**End-to-end trace (refine_loop):** +``` +refine_loop(provider, prompt) + → CostTracker() + ConvergenceDetector() + → checked_complete() [_check_budget → reserve_call → with_retry → provider.complete] + → tracker.record_without_call(response) + → Evaluator(text, provider) [re-enters checked_complete with same tracker] + → ConvergenceDetector.should_stop(score) + → PatternResult(cost=tracker.to_usage(), metadata=MappingProxyType(...)) +``` + +**One undocumented coupling:** The default evaluator in `refine_loop` captures `tracker` and `max_cost` from the outer scope (`refine_loop.py:153–160`). A custom `Evaluator` caller won't know their evaluator participates in budget accounting. This is correct behavior but should be documented in the `Evaluator` type alias docstring. + +### 3.4 Contract Stability + +- **Public API:** 38 names in `__all__` at `executionkit/__init__.py:49–90`. Version `"0.1.0"`. +- **Protocols:** `LLMProvider` (`provider.py:118`), `ToolCallingProvider` (`provider.py:137`). +- **Risk:** `checked_complete` (`__init__.py:77`) and `validate_score` (`__init__.py:89`) are low-level internals now part of the contracted surface. Any internal budget-guard or score-validation change breaks callers who imported these directly. +- **Recent changes:** `structured` and `structured_sync` added to `__all__` in commit `cc70ce1`. No removals or renames visible in the last four commits. + +### 3.5 Configuration Surface + +**Environment variables read by the library:** Zero. `executionkit/` contains no `os.environ` or `os.getenv` calls. `docs/api/configuration.md:108` documents this by design. + +**Config files read at runtime:** None. Zero `open()` calls in `executionkit/` source. + +**Required env vars with no documented default:** None within the library. Examples use `os.environ["OPENAI_API_KEY"]` without fallback — caller responsibility, not library responsibility. + +--- + +## 4. Code Health + +### 4.1 Oversized Files (>500 lines) + +| LOC | File | Status | +|----:|------|--------| +| **502** | `executionkit/provider.py` | ⚠️ 2 lines over cap | +| 358 | `executionkit/patterns/react_loop.py` | OK (but see §4.2) | +| 237 | `executionkit/patterns/refine_loop.py` | OK | +| 232 | `executionkit/patterns/base.py` | OK | + +Only `provider.py` breaches the 500-line cap — marginally. The higher risk is `react_loop.py` at 358 lines of dense async logic. + +### 4.2 Oversized Functions (>40 lines) + +| Lines | File:Range | Function | +|------:|-----------|---------| +| **182** | `patterns/refine_loop.py:56–237` | `refine_loop` | +| **161** | `patterns/react_loop.py:149–309` | `react_loop` | +| **112** | `patterns/consensus.py:24–135` | `consensus` | +| **88** | `patterns/structured.py:40–127` | `structured` | +| **68** | `engine/json_extraction.py:71–138` | `_extract_balanced` | +| **67** | `compose.py:90–156` | `pipe` | +| **61** | `patterns/base.py:89–149` | `checked_complete` | +| **48** | `engine/json_extraction.py:21–68` | `extract_json` | +| **47** | `patterns/react_loop.py:312–358` | `_execute_tool_call` | +| **44** | `engine/retry.py:72–115` | `with_retry` | + +9 of 100 functions (9%) exceed the 40-line cap. The four core pattern functions are the primary refactoring targets — each is a monolithic state machine mixing distinct responsibilities. + +### 4.3 Cyclomatic Complexity + +*Estimated via AST branch-node counting (radon not installed):* + +| CC | File:Line | Function | +|---:|----------|---------| +| **20** | `patterns/react_loop.py:149` | `react_loop` | +| **16** | `engine/json_extraction.py:71` | `_extract_balanced` | +| **13** | `patterns/react_loop.py:34` | `_validate_tool_args` | +| **12** | `provider.py:354` | `_extract_content` | +| **12** | `patterns/refine_loop.py:56` | `refine_loop` | +| **12** | `patterns/consensus.py:24` | `consensus` | +| **9** | `engine/retry.py:72` | `with_retry` | +| **9** | `engine/json_extraction.py:21` | `extract_json` | +| **9** | `compose.py:64` | `_filter_kwargs` | +| **8** | `provider.py:381` | `_parse_tool_calls` | + +Average CC across 100 functions: **3.28** (healthy). Functions with CC ≥ 10: **6** (industry threshold for "needs refactoring"). `react_loop` at CC=20 is the single highest-risk function. + +### 4.4 Type Safety + +``` +python -m mypy --strict executionkit/ +→ Success: no issues found in 20 source files +``` + +**PASS.** Zero strict errors. Exceptional for a library using generics (`PatternResult[T]`) and structural protocols. + +### 4.5 Linter Density + +``` +python -m ruff check executionkit/ --statistics +→ 1 TC003 typing-only-standard-library-import +→ Found 1 error. +``` + +**1 finding total. 0.35 errors/KLOC.** + +Location: `executionkit/kit.py:5` — `TracebackType` imported at module level; should be inside `if TYPE_CHECKING:`. One-line fix. + +### 4.6 Test Pyramid + +| Tier | Count | +|------|------:| +| Unit tests | **379** | +| Integration (`@pytest.mark.integration`) | **0** | +| E2E | **0** | + +100% unit tests is appropriate for a pure-library with no network dependency. The gap: **zero** integration tests validating real HTTP provider behavior. Any regression in the live transport paths (`_post_httpx`, `_post_urllib`) goes undetected without real API calls. + +### 4.7 Dead Code + +| Location | Symbol | Status | +|----------|--------|--------| +| `cost.py:39` | `CostTracker.release_call()` | No call sites in source or tests | +| `_mock.py:83` | `MockProvider.last_call()` | No call sites in any test | +| `engine/messages.py:13` | `assistant_message()` | No import or call sites anywhere | + +`assistant_message()` is the clearest dead export — the entire `engine/messages.py` module exists to export this function and `user_message()`, but only `user_message()` is used (by `consensus.py` and `structured.py`). `assistant_message()` has no callers. + +--- + +## 5. Test & Quality Coverage + +### 5.1 Coverage by Module + +**Overall: 85.11% — 80% threshold met. 387 tests, 0 failures.** + +| Module | Coverage | Status | +|--------|--------:|--------| +| `__init__.py` | 100% | ✅ | +| `errors.py` | 100% | ✅ | +| `types.py` | 100% | ✅ | +| `engine/convergence.py` | 100% | ✅ | +| `engine/parallel.py` | 100% | ✅ | +| `engine/retry.py` | 98% | ✅ | +| `patterns/consensus.py` | 98% | ✅ | +| `patterns/react_loop.py` | 96% | ✅ | +| `patterns/refine_loop.py` | 94% | ✅ | +| `kit.py` | 87% | ✅ | +| `_mock.py` | 88% | ✅ | +| `compose.py` | 86% | ✅ | +| `patterns/structured.py` | 89% | ✅ | +| `cost.py` | 84% | ✅ | +| `engine/json_extraction.py` | 83% | ✅ | +| `engine/messages.py` | 83% | ✅ | +| `patterns/base.py` | 100% | ✅ | +| **`provider.py`** | **58%** | 🔴 | + +`provider.py` is the sole module below threshold. Uncovered lines 258–378 are the live HTTP transport paths. Closing this requires mock-HTTP testing via `respx` (httpx) and `unittest.mock.patch` (urllib). + +### 5.2 Trivial / Low-Value Tests (~14 tests) + +| File:Lines | Category | Issue | +|-----------|----------|-------| +| `test_provider.py:37–62` | `issubclass`-only (×9) | Verifies Python MRO, not error behavior | +| `test_provider.py:261–263` | `isinstance`-only | Checks protocol conformance without calling any method | +| `test_provider.py:265–269` | `isinstance`-only | Same for MockProvider | +| `test_provider.py:332` | `isinstance`-only | ToolCallingProvider check, no call | +| `test_provider.py:619–620` | `isinstance(x, bool)` | Can never fail | +| `test_types.py:278–282` | Enum string values | `VotingStrategy.MAJORITY == "majority"` — can never fail (inherits `str`) | +| `test_types.py:284–286` | `isinstance(x, str)` | `VotingStrategy` inherits `str`; check is trivially true | + +### 5.3 Async Test Conformance + +`pyproject.toml:65`: `asyncio_mode = "auto"` — all `async def test_*` are auto-treated as asyncio tests. `@pytest.mark.asyncio` is **redundant** everywhere. + +**57 redundant marks** across 4 files: +- `test_compose.py`: 12 occurrences +- `test_kit.py`: 12 occurrences +- `test_patterns.py`: 26 occurrences +- `test_provider.py`: 7 occurrences + +**Additional:** `test_sync_wrappers.py:47` uses `asyncio.run(inner())` inside a sync test without `@pytest.mark.filterwarnings("ignore::RuntimeWarning")`, leaking a `RuntimeWarning: coroutine was never awaited` into coverage output. The other two similar tests (`test_sync_and_parse.py:91,114`) correctly suppress this warning. + +### 5.4 CI Gate + +Source: `.github/workflows/ci.yml` + +| Gate | Enforced | Detail | +|------|---------|--------| +| ruff check | ✅ | `ci.yml:31` | +| ruff format --check | ✅ | `ci.yml:34` | +| mypy --strict | ✅ | `ci.yml:37` | +| pytest --cov-fail-under=80 | ✅ | `ci.yml:40` | +| bandit -r executionkit/ | ✅ | `ci.yml:64` (separate job) | +| Python 3.11, 3.12, 3.13 | ✅ | `ci.yml:16–17` | +| Ubuntu + Windows | ✅ | `ci.yml:15` | +| `-m "not integration"` filter | ⚠️ **Missing** | `pytest tests/` without exclusion | +| Integration tests with API keys | ❌ | No secrets-injected integration job | +| Codecov fail-on-error | ⚠️ | `fail_ci_if_error: false` — silent on upload failure | + +**Risk:** `ci.yml:40` runs `pytest tests/` without `-m "not integration"`. If any future integration test doesn't skip gracefully when API keys are absent, it will fail CI. Add `-m "not integration"` to the base CI command. + +### 5.5 Evaluation Framework + +- **`Evaluator`:** Tested behaviorally through `refine_loop` tests in `test_patterns.py:266–460` — custom evaluators, convergence scenarios, and adversarial prompt-injection resistance (XML sandboxing via ``) are all covered. +- **`ConvergenceDetector`:** `test_engine.py:405–556` provides thorough coverage (100% on `engine/convergence.py`) covering boundary scores, patience, delta threshold, oscillation, and reset. +- **Executable benchmarks:** None. The `_analysis/` and `.full-review/` directories contain static review documents, not runnable quality benchmarks. No evaluation harness measures scoring accuracy against known ground truth. + +--- + +## 6. Security & Sanitization (local-dev scope) + +### 6.1 Secret Scan + +- **`.pre-commit-config.yaml`:** `detect-private-key` hook active (rev `v4.6.0`). Covers PEM-format private keys. Does **not** cover high-entropy API key strings (e.g., `sk-proj-...` format). +- **`bandit`:** 4 total findings — 0 HIGH, 1 MEDIUM (`B310` urllib `file:/` scheme at `provider.py:310`, suppressed), 3 LOW (B101 asserts, B311 random.uniform for jitter). +- **`.gitignore`:** `.env` and `.env.*` ignored at lines 16–18 with `!.env.example` carve-out at line 19. Correct. +- **Hardcoded credentials:** Zero matches in `executionkit/` for `sk-`, `Bearer `, or `api_key = "..."` patterns. + +**Gap:** `detect-secrets` (Yelp) is absent. High-entropy string scanning beyond PEM format is not automated. + +### 6.2 Input Surfaces + +| Surface | Entry point | Validation | +|---------|-------------|------------| +| API key | `Provider(api_key=...)` (`provider.py:164`) | Masked in `__repr__`; redacted from errors via `_redact_sensitive()` (`provider.py:438–449`); no format validation (empty string accepted) | +| User prompts | `prompt: str` in all patterns | Not sanitized; evaluator input wrapped in `` XML delimiters (`refine_loop.py:150`) | +| Tool JSON Schema | `tools: Sequence[Tool]` in `react_loop` | `_validate_tool_args` validates required fields and type correctness (`react_loop.py:34–66`); schema not validated against JSON Schema meta-schema | +| LLM responses | `_parse_tool_calls` → `_parse_tool_arguments` (`provider.py:381–423`) | Parsed and type-checked; trusted as structured data | +| File reads | None | No `open()` calls in `executionkit/`; zero file I/O | + +### 6.3 Privileged Operations + +**Zero.** Grep across `executionkit/` found no `subprocess`, `os.system`, `os.popen`, `shutil.rmtree`, writable `open()`, `exec(`, `eval(`, or shell operations. Only network I/O: outbound `urllib.request.urlopen` and `httpx.AsyncClient.post` to caller-configured endpoints. + +### 6.4 Dependency Audit + +`pip-audit` not installed in this environment. **Runtime exposure: zero** (stdlib only). `httpx>=0.27` (optional): no known HIGH+ CVEs at time of writing. Dev dependencies (bandit, mypy, ruff, pytest, etc.): no known HIGH CVEs. Gap: no automated CVE scanning in CI. + +### 6.5 Auth Surface + +`api_key` stored as frozen dataclass field for the Provider instance lifetime (`provider.py:164`). Not written to disk, not logged. Injected into `Authorization: Bearer` header at request time only (`provider.py:257–258`). `_redact_sensitive()` regex at `provider.py:444` strips `sk*`, `key*`, `token*`, `bearer*`, `auth*` prefixes — keys with non-standard prefixes would pass through error text unredacted. + +--- + +## 7. Developer Experience + +### 7.1 Cold-Start Path + +| Step | Command | Documented? | +|------|---------|-------------| +| 1 | `git clone ` | ✅ README | +| 2 | `python -m venv .venv && source .venv/bin/activate` | ⚠️ **Missing from README** | +| 3 | `pip install -e ".[dev]"` | ✅ README | +| 4 | `pre-commit install` | ❌ **Not documented anywhere** | +| 5 | `pytest -m "not integration" -q` | ✅ CLAUDE.md / Makefile | + +**4-step minimum.** Step 2 (venv creation) is absent from `README.md:86–90`, creating first-contributor friction on Windows especially. Step 4 (pre-commit install) is entirely undocumented. + +### 7.2 Setup Script Audit + +**Makefile targets:** `dev-setup`, `lint`, `format`, `format-check`, `type-check`, `test`, `coverage`, `build`, `clean`, `docs-serve`, `docs-build`, `docs-deploy`. + +Gaps: +- No `make install-precommit` target. +- `make docs-serve` requires `mkdocs` and `mkdocs-material` which are in the `docs` extras group, **not** `dev` — `make dev-setup` leaves docs tools uninstalled. +- `pre-commit` is not in any `pyproject.toml` extras group; contributors must discover and install it externally. + +### 7.3 Friction Inventory + +| Item | Detail | +|------|--------| +| `os.getenv` defaults | Zero calls in `executionkit/` — by design; no missing defaults | +| Platform divergence | Zero `sys.platform` or `os.name` checks in `executionkit/`; fully cross-platform | +| httpx upper bound | `httpx>=0.27` has no upper bound; a future httpx 1.x breaking change would not be caught | +| Other dev dep bounds | Most deps are lower-bound only; only `ruff` has an upper bound (`<0.16`) | +| TODO/FIXME/HACK | Zero in `executionkit/` source | + +### 7.4 Documentation Coverage + +| Metric | Value | +|--------|-------| +| Module-level docstrings | 20/20 = **100%** | +| Public function docstrings | 43/49 = **88%** | +| Example runnability | All 6 examples use `os.environ["OPENAI_API_KEY"]` + `asyncio.run()` — runnable with a live key | + +The 6 undocumented functions are primarily dunder methods and the `_run_sync` private helper. + +--- + +## 8. Findings & Recommendations + +### 🔴 Critical — Blocks CI or violates stated rules + +| # | Finding | Evidence | Impact | Recommendation | Effort | +|---|---------|---------|--------|---------------|--------| +| 1 | **`provider.py` at 58% coverage** — the central HTTP module is the library's most critical seam and its largest blind spot | `provider.py` lines 258–378 uncovered; §5.1 | Live HTTP regressions (error classification, content extraction, tool call parsing) go undetected | Add `respx`-based mock-HTTP tests for `_post_httpx` paths and `unittest.mock.patch("urllib.request.urlopen")` for `_post_urllib`; target ≥ 85% | M | +| 2 | **CI runs `pytest tests/` without `-m "not integration"`** | `.github/workflows/ci.yml:40` | Any future integration test without a graceful skip guard will fail every CI run without API keys | Append `-m "not integration"` to the CI pytest command immediately | S | + +### 🟡 High — Degrades maintainability, testability, or security posture + +| # | Finding | Evidence | Impact | Recommendation | Effort | +|---|---------|---------|--------|---------------|--------| +| 3 | **`react_loop` at CC=20, 161 lines** — highest complexity function in the codebase | `patterns/react_loop.py:149–309`; §4.2–4.3 | Hardest function to understand, modify, or test correctly; bug surface is proportional to CC | Extract sub-functions: `_run_think_act_loop()`, `_handle_tool_results()`, `_build_tool_message()` | M | +| 4 | **`refine_loop` at 182 lines** — monolithic state machine | `patterns/refine_loop.py:56–237`; §4.2 | Mixed concerns (loop, budget, convergence, evaluator) make it fragile to modify | Extract `_run_refinement_step()` and `_invoke_evaluator()` sub-functions | M | +| 5 | **`checked_complete` and `validate_score` exposed in `__all__`** — internal utilities with a public contract | `executionkit/__init__.py:77,89`; §3.4 | Any change to budget-guard or score-validation internals becomes a breaking API change | Move to `executionkit._internals` (private by convention) or add explicit `@unstable` / versioning notes in docstrings | S | +| 6 | **57 redundant `@pytest.mark.asyncio` decorators** — noise against `asyncio_mode="auto"` | `test_compose.py`, `test_kit.py`, `test_patterns.py`, `test_provider.py`; §5.3 | Test file maintenance burden; misleads readers about what configuration is needed | Remove all 57 redundant marks in one pass | S | +| 7 | **`assistant_message()` has no callers** — dead export | `engine/messages.py:13`; §4.7 | Dead code adds surface area and maintenance obligation | Remove from `engine/messages.py` and `engine/__init__.py.__all__`; verify not used externally before removal | S | +| 8 | **Default evaluator's `CostTracker` closure sharing is undocumented** | `patterns/refine_loop.py:153–160`; §3.3 | Custom `Evaluator` authors don't know their evaluator participates in budget accounting | Add note to `Evaluator` type alias docstring that evaluators called by `refine_loop` share the outer budget tracker | S | +| 9 | **Single ruff TC003 violation in `kit.py:5`** | `executionkit/kit.py:5`; §4.5 | CI lint check would catch this if ruff rules include TC003 (they do) — currently failing silently | Move `from types import TracebackType` inside `if TYPE_CHECKING:` | S | +| 10 | **README omits venv step and `pre-commit install`; `pre-commit` not in dev extras** | `README.md:86–90`; §7.1, §7.2 | New contributors (especially on Windows) install dev dependencies into global Python; hooks never installed | Add venv creation step to README; add `pre-commit` to `dev` extras; add `make install-precommit` target | S | + +### 🟢 Opportunistic — Quality improvements, DX wins, tech debt reduction + +| # | Finding | Evidence | Impact | Recommendation | Effort | +|---|---------|---------|--------|---------------|--------| +| 11 | **No `detect-secrets` hook** | `.pre-commit-config.yaml`; §6.1 | High-entropy API key strings could be committed undetected | Add `detect-secrets` pre-commit hook alongside `detect-private-key` | S | +| 12 | **No `pip-audit` in CI or dev extras** | `pyproject.toml:42–50`; §6.4 | Zero runtime deps today, but the gap is invisible if a dep is added later | Add `pip-audit` to `dev` extras; add `pip-audit` step to `ci.yml` | S | +| 13 | **`httpx` and most dev deps lack upper-bound pins** | `pyproject.toml:44`; §7.3 | Future breaking upstream releases could silently break the dev install | Add `httpx<1.0` (or follow httpx semver); add upper bounds for major-version-unstable deps | S | +| 14 | **`release_call()` and `last_call()` are dead code** | `cost.py:39`, `_mock.py:83`; §4.7 | Unused public methods imply false API richness | Remove or add tests that exercise them; add a `# Future API` comment if intentional | S | +| 15 | **14 low-value tests** | `test_provider.py:37–62`, `test_types.py:278–286`; §5.2 | The 9 issubclass-only tests and enum-value identity tests can never detect real behavioral regressions | Replace issubclass-only tests with "raise-and-catch" behavioral tests; remove the trivially-true `isinstance(VotingStrategy.MAJORITY, str)` checks | S | + +--- + +## 9. Appendix + +### A. Raw Stats + +| Metric | Value | +|--------|-------| +| Total source modules | 20 | +| Source LOC | 2,836 | +| Test LOC | 4,507 | +| Total test functions | 379 | +| Test-to-source ratio | 1.59× | +| Overall coverage | 85.11% | +| mypy --strict errors | 0 | +| ruff errors | 1 (TC003, kit.py:5) | +| ruff errors/KLOC | 0.35 | +| bandit HIGH findings | 0 | +| bandit MEDIUM findings | 1 (B310, suppressed) | +| Functions > 40 lines | 9 / 100 (9%) | +| Functions with CC ≥ 10 | 6 / 100 (6%) | +| Average CC | 3.28 | +| Redundant `@pytest.mark.asyncio` | 57 | +| Mandatory runtime deps | 0 | +| Optional runtime deps | 1 (httpx) | +| CI OS matrix | ubuntu-latest, windows-latest | +| CI Python matrix | 3.11, 3.12, 3.13 | + +### B. Files Inspected + +**Full reads:** +`executionkit/__init__.py`, `executionkit/_mock.py`, `executionkit/compose.py`, `executionkit/cost.py`, `executionkit/errors.py`, `executionkit/kit.py`, `executionkit/provider.py`, `executionkit/types.py`, `executionkit/engine/__init__.py`, `executionkit/engine/convergence.py`, `executionkit/engine/json_extraction.py`, `executionkit/engine/messages.py`, `executionkit/engine/parallel.py`, `executionkit/engine/retry.py`, `executionkit/patterns/__init__.py`, `executionkit/patterns/base.py`, `executionkit/patterns/consensus.py`, `executionkit/patterns/react_loop.py`, `executionkit/patterns/refine_loop.py`, `executionkit/patterns/structured.py`, `tests/conftest.py`, `tests/test_compose.py`, `tests/test_concurrency.py`, `tests/test_engine.py`, `tests/test_exports.py`, `tests/test_kit.py`, `tests/test_patterns.py`, `tests/test_provider.py`, `tests/test_sync_and_parse.py`, `tests/test_sync_wrappers.py`, `tests/test_types.py`, `pyproject.toml`, `.github/workflows/ci.yml`, `.pre-commit-config.yaml`, `Makefile`, `docs/architecture.md`, `CLAUDE.md` + +**Head/grep only:** +`README.md`, `SECURITY.md`, `.gitignore`, `examples/*.py`, `.github/workflows/docs.yml`, `.github/workflows/publish.yml`, `.github/dependabot.yml` + +**Not read:** +`dev/planning/*`, `.full-review/*`, `docs/c4/*` (out of scope for this analysis) + +### C. Commands Executed + +```bash +git log --oneline -30 +git log --oneline -20 -- executionkit/__init__.py +find . -not -path './.git/*' ... | sort # structure listing +python -c "..." # LOC counting per extension +python -m pytest --cov=executionkit --cov-report=term-missing -m "not integration" -q +python -m mypy --strict executionkit/ +python -m ruff check executionkit/ --statistics +python -m mypy --version +# radon not installed; CC estimated via AST branch-node counting +# vulture not installed; dead code found via AST analysis +# pip-audit not installed; manual dep assessment performed +``` + +### D. Methodology Notes and Limitations + +1. **Cyclomatic complexity** was estimated via Python AST branch-node counting, not `radon cc`. Results are directionally accurate but may differ from radon's McCabe implementation by ±1–2 per function. +2. **`pip-audit` and `vulture`** were not installed. Dependency CVE exposure and dead code are assessed manually. +3. **Integration test behavior in CI** (finding #2) is a structural risk identified from static analysis; actual failure would require a future integration test to be added without skip guards. +4. **`provider.py` coverage gap** was identified from `pytest --cov` term-missing output; lines 258–378 are the live HTTP transport paths, not logic paths reachable by `MockProvider`. +5. **UNVERIFIED:** `examples/quickstart_ollama.py` content was identified by name only, not read. All other findings are directly evidenced from file reads and command output.