docs: Eval & Grader Registry design doc (#13)#337
Conversation
Adds docs/design/13-eval-registry.md covering the design for a shared eval and grader registry. Design-only; no implementation. Note: issue #13 asked for docs/research/, but that path is gitignored. Placed in docs/design/ to match existing convention (135-improve-concurrency.md, 194-baseline-skill-impact.md) and to answer the open question from the issue validation comment. Decisions cover sub-issues: - #15 Go-module-style refs: ref syntax, SemVer + lockfile, content-addressed cache, gh/env auth, flat transitive deps. - #17 Composable eval construction: registry search/add/get/sync, deep-merge override rules, waza init --grader scaffolding. - #18 Plugin extensibility: WGP/1 protocol over WASM (sandboxed) and program (bring-your-own-binary); Go plugins and embedded scripting rejected with rationale. Includes security model, backward-compat impact, a 5-phase rollout (spec, local resolver, git backend, WASM runtime, hardening), open questions, rejected alternatives, and example end-state YAML + lockfile. Backend selection (#16) deferred to start of Phase 2. Refs #13 #15 #17 #18
There was a problem hiding this comment.
Pull request overview
Adds a design document proposing a shared eval & grader registry for waza, covering reference syntax/lockfiles, registry discovery & composition UX, and an extensibility model (WASM + external program protocol) intended to close the “shared registry” competitive gap vs. OpenAI Evals.
Changes:
- Introduces a full design doc for registry refs (
host/path@version#subpath), caching, lockfiles, auth, and transitive deps (#15). - Specifies CLI UX for discovery/composition (
waza registry search/add/get/sync/list) and deep-merge override rules (#17). - Proposes a plugin model and security posture (WGP/1 + WASM sandbox) with a phased rollout plan (#18).
Show a summary per file
| File | Description |
|---|---|
| docs/design/13-eval-registry.md | New design doc describing the eval/grader registry architecture, UX, security model, and rollout phases. |
Copilot's findings
- Files reviewed: 1/1 changed files
- Comments generated: 4
| ref = host "/" path [ "@" version ] [ "#" subpath ] | ||
| host = DNS-1123 hostname (e.g. github.com, gitlab.example.com) | ||
| path = 1*( segment "/" ) segment | ||
| version = semver | "latest" | "main" | git-sha-7..40 |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| **Status:** Draft for review | ||
| **Date:** 2026-06-19 | ||
|
|
||
| > **Note on location:** Issue #13 originally specified `docs/research/waza-eval-registry-design.md`, but that directory is gitignored (`.gitignore` line 116, "Internal research docs"). This doc lives under `docs/design/` to match the established convention (`docs/design/135-improve-concurrency.md`, `docs/design/194-baseline-skill-impact.md`) and to ensure it ships in the repo. Validation comment on #13 explicitly asked which location is canonical; this PR proposes `docs/design/` as the answer. |
| ## 10. Compatibility considerations | ||
|
|
||
| - **Backward compatible.** Existing `eval.yaml` files keep working: the `ref:` field is new and additive. `GraderConfig.UnmarshalYAML` already uses `KnownFields(true)`, so we'll need to add `ref` to the raw struct; no other change is required for existing graders. | ||
| - **JSON schema.** `schemas/eval.schema.json` (referenced by `site/src/content/docs/reference/schema.mdx`) needs a new union: either the existing typed grader or `{ ref: string, name?, weight?, model?, config? }`. We can express this as `oneOf` keyed on the presence of `ref`. |
spboyer
left a comment
There was a problem hiding this comment.
Adds a registry design doc; the structure is solid, but several spec details would mislead or weaken implementation.
Issues to address:
- docs/design/13-eval-registry.md:119 - subpath grammar lacks explicit traversal and symlink-escape rejection
- docs/design/13-eval-registry.md:441 - compromised-index mitigation overstates first-use integrity guarantees
- docs/design/13-eval-registry.md:158 - cache path is Linux/XDG-only instead of OS cache-dir based
- docs/design/13-eval-registry.md:176 - GitLab credential command does not return a token
- docs/design/13-eval-registry.md:348 - wasmtime-go static dependency claim misses CGO/cross-compile tradeoffs
- docs/design/13-eval-registry.md:477 - GitHub refs rely on remote git archive, which GitHub does not support
- docs/design/13-eval-registry.md:309 - summary says one runtime but design uses WASM plus program runtimes
| host = DNS-1123 hostname (e.g. github.com, gitlab.example.com) | ||
| path = 1*( segment "/" ) segment | ||
| version = semver | "latest" | "main" | git-sha-7..40 | ||
| subpath = POSIX path relative to repo root |
There was a problem hiding this comment.
[HIGH] Security: #subpath is only defined as a POSIX path relative to the repo root, but the design never requires normalization or rejection of .., absolute paths, or symlink escapes. Resolver implementations could extract or read files outside the artifact/cache root when resolving subdirectory refs. Specify that subpaths are cleaned and must remain under the resolved artifact root after symlink evaluation.
| | Private repo token leakage | Tokens read from env or `gh auth`; never written to lockfile or results.json; redacted in `--verbose` logs. | | ||
| | Cache poisoning | Content-addressed cache (sha256); on read, digest is re-verified before use. | | ||
| | Typosquatting in federated index | Hosts are explicit (`github.com/waza-evals/...`); no short names; `waza registry add` shows the full ref and (post-v1) signature info before writing the lockfile. | | ||
| | Compromised index URL | Each entry's `digest`/`source` is verified at fetch time, so a hostile index can only deny service or surface known-good artifacts — it cannot substitute content. | |
There was a problem hiding this comment.
[HIGH] Security: The compromised-index row overstates what digest verification can guarantee on first use. If a hostile index supplies both source and digest, fetching bytes and verifying them against that digest only proves the artifact matches the hostile metadata, not that it is trusted. Clarify that lockfiles protect subsequent resolutions and that first-use substitution requires trusted metadata/signatures or explicit user trust.
| ### 5.3 Cache layout | ||
|
|
||
| ``` | ||
| $XDG_CACHE_HOME/waza/registry/ |
There was a problem hiding this comment.
[MEDIUM] Cross-platform portability: The cache layout is specified as $XDG_CACHE_HOME/..., which is Linux-specific. Waza is a cross-platform Go CLI, so literal XDG semantics will either diverge on macOS/Windows or place cache data in non-idiomatic locations. Define the root as Go's os.UserCacheDir() with per-OS examples instead.
| | Host pattern | Default credential source | Override | | ||
| |--------------|---------------------------|----------| | ||
| | `github.com/*` | `gh auth token` (if installed); else `GITHUB_TOKEN` env | `WAZA_REGISTRY_TOKEN_GITHUB_COM` | | ||
| | `gitlab.*` | `glab auth status` (if installed); else `GITLAB_TOKEN` | `WAZA_REGISTRY_TOKEN_<HOST_UPPER_DOTS_TO_UNDERSCORE>` | |
There was a problem hiding this comment.
[MEDIUM] Documentation: glab auth status verifies login state but does not return a token, so the credential source described here is not implementable as written. This would break GitLab private registry auth if copied directly into Phase 2. Use glab auth token or glab auth status -t for parity with the GitHub row.
| | `gitlab.*` | `glab auth status` (if installed); else `GITLAB_TOKEN` | `WAZA_REGISTRY_TOKEN_<HOST_UPPER_DOTS_TO_UNDERSCORE>` | | |
| | `gitlab.*` | `glab auth token` (if installed); else `GITLAB_TOKEN` | `WAZA_REGISTRY_TOKEN_<HOST_UPPER_DOTS_TO_UNDERSCORE>` | |
|
|
||
| ### 7.3 WASM runtime details | ||
|
|
||
| - Runtime: `wasmtime-go` (mature, sandboxed, cross-platform, single static dependency). |
There was a problem hiding this comment.
[MEDIUM] Documentation: wasmtime-go is described as a single static dependency, but it is a CGO binding to the Wasmtime C API. That tradeoff affects static builds and cross-compilation for a Go CLI, so leaving it out makes the Phase 3 runtime choice look lower-risk than it is. Either acknowledge the CGO/release-artifact impact or compare wazero as the pure-Go alternative.
|
|
||
| ### Phase 2 — Git backend + auth + cache (4–6 weeks) | ||
|
|
||
| - Add `git` backend resolving `github.com/owner/repo@version` via `git archive` (no full clone) for tags, fallback to `git ls-remote` + sparse fetch for branches/SHAs. |
There was a problem hiding this comment.
[MEDIUM] Documentation: The Git backend plan says GitHub refs resolve via git archive, but GitHub does not support the remote git archive protocol. Implementing this literally will fail for the primary github.com/owner/repo@version case. Use GitHub archive/tarball downloads for GitHub, or make shallow/sparse fetch the primary path.
| | Embedded scripting (Lua/Starlark/JS) | ✅ | ⚠️ Sandbox quality varies | Slower than native | Easy (text artifact) | ❌ Rejected (yet-another-language; WASM gives us all of these via guest toolchains) | | ||
| | In-proc Python | ✅ Where python exists | ❌ | Slow startup | Hard (env hell) | ❌ Rejected (replicates OpenAI Evals' main pain point) | | ||
|
|
||
| Two formats, one protocol, one runtime, no language lock-in. |
There was a problem hiding this comment.
[LOW] Documentation: "one runtime" contradicts the design in §7.2/§7.3, which has a WASM runtime plus an external program runner sharing WGP/1. That makes the summary misleading when readers compare it to the detailed architecture. Change it to "two formats, one protocol, two runtimes, no language lock-in."
Closes #13. Refs #15, #17, #18.
Design-only doc for a shared eval & grader registry — waza's #1 competitive gap vs. OpenAI Evals. No code changes.
What's inside
docs/design/13-eval-registry.mdcovers:host/path@version#subpathsyntax, SemVer +eval.lock.yamlfor reproducibility, content-addressed cache,gh auth token/ env-var auth, flat transitive-dep resolution.waza registry search/add/get/sync/listCLI, federated index file, deep-merge override rules,waza init --graderscaffolding.ref:field onGraderConfig, schema update, results.jsonsourcefield), and a 5-phase rollout where each phase is independently shippable:programruntime (validates the contract without picking a backend)eval.yamlandeval.lock.yaml.Path note
Issue body specified
docs/research/waza-eval-registry-design.md, butdocs/research/is gitignored (.gitignoreline 116, "Internal research docs"). The validation comment on #13 explicitly asked which location is canonical (docs/design/vsdocs/plans/). I placed the doc atdocs/design/13-eval-registry.mdto match existing convention (135-improve-concurrency.md,194-baseline-skill-impact.md). Happy to move if the team prefersdocs/plans/.Out of scope (intentional)
Resolverinterface; concrete backend chosen at start of Phase 2 based on artifact-size benchmarks.github.com/waza-evals/openai-compat/*namespace) without committing to it here.Review asks
--frozenfor CI approach matches how teams want to consume registry graders.gh auth tokenfor GitHub by default, env-var overrides per host, never store secrets in~/.waza/credentials.yaml.docs/design/is my recommendation).