docs: Eval & Grader Registry design doc (#13) by spboyer · Pull Request #337 · microsoft/waza

spboyer · 2026-06-19T13:11:27Z

Closes #13. Refs #15, #17, #18.

Design-only doc for a shared eval & grader registry — waza's #1 competitive gap vs. OpenAI Evals. No code changes.

What's inside

docs/design/13-eval-registry.md covers:

feat: Go-module-style grader/eval references #15 — Go-module-style refs: host/path@version#subpath syntax, SemVer + eval.lock.yaml for reproducibility, content-addressed cache, gh auth token / env-var auth, flat transitive-dep resolution.
feat: Composable eval construction from registry graders #17 — Composable construction: waza registry search/add/get/sync/list CLI, federated index file, deep-merge override rules, waza init --grader scaffolding.
feat: Grader plugin extensibility (WASM/external programs) #18 — Plugin extensibility: WGP/1 (Waza Grader Protocol) over two runtimes — WASM (sandboxed via wasmtime, preferred for registry-distributed graders) and program (formalized bring-your-own-binary). Go plugins and embedded scripting rejected with rationale.
Security model (digest pinning, sandbox limits, no-secret lockfile), backward-compat impact (additive ref: field on GraderConfig, schema update, results.json source field), and a 5-phase rollout where each phase is independently shippable:
1. Spec & schema
2. Local-only resolver + program runtime (validates the contract without picking a backend)
3. Git backend + auth + cache
4. WASM runtime + sandbox
5. Hardening (sigstore, vet, OCI backend)
Decision matrix (D1–D10), open questions, rejected alternatives, example end-state eval.yaml and eval.lock.yaml.

Path note

Issue body specified docs/research/waza-eval-registry-design.md, but docs/research/ is gitignored (.gitignore line 116, "Internal research docs"). The validation comment on #13 explicitly asked which location is canonical (docs/design/ vs docs/plans/). I placed the doc at docs/design/13-eval-registry.md to match existing convention (135-improve-concurrency.md, 194-baseline-skill-impact.md). Happy to move if the team prefers docs/plans/.

Out of scope (intentional)

feat: Registry backend evaluation (Git/OCI/Releases/federated) #16 — Backend selection (Git vs OCI vs Releases vs federated). Design is backend-agnostic via a Resolver interface; concrete backend chosen at start of Phase 2 based on artifact-size benchmarks.
feat: Map OpenAI Evals YAML format → waza graders #14 — OpenAI Evals format mapping. Design ensures it's expressible (via a future github.com/waza-evals/openai-compat/* namespace) without committing to it here.
Implementation. Phase 0 is the only thing that touches code, and it's deferred to a follow-up issue.

Review asks

Sanity-check the WASM-vs-program split (D9, §7.1). The rationale for rejecting Go plugins and embedded scripting is in §13.
Confirm the lockfile + --frozen for CI approach matches how teams want to consume registry graders.
Confirm the auth model (D4): gh auth token for GitHub by default, env-var overrides per host, never store secrets in ~/.waza/credentials.yaml.
Decide canonical doc location (docs/design/ is my recommendation).

Adds docs/design/13-eval-registry.md covering the design for a shared eval and grader registry. Design-only; no implementation. Note: issue #13 asked for docs/research/, but that path is gitignored. Placed in docs/design/ to match existing convention (135-improve-concurrency.md, 194-baseline-skill-impact.md) and to answer the open question from the issue validation comment. Decisions cover sub-issues: - #15 Go-module-style refs: ref syntax, SemVer + lockfile, content-addressed cache, gh/env auth, flat transitive deps. - #17 Composable eval construction: registry search/add/get/sync, deep-merge override rules, waza init --grader scaffolding. - #18 Plugin extensibility: WGP/1 protocol over WASM (sandboxed) and program (bring-your-own-binary); Go plugins and embedded scripting rejected with rationale. Includes security model, backward-compat impact, a 5-phase rollout (spec, local resolver, git backend, WASM runtime, hardening), open questions, rejected alternatives, and example end-state YAML + lockfile. Backend selection (#16) deferred to start of Phase 2. Refs #13 #15 #17 #18

Copilot

Pull request overview

Adds a design document proposing a shared eval & grader registry for waza, covering reference syntax/lockfiles, registry discovery & composition UX, and an extensibility model (WASM + external program protocol) intended to close the “shared registry” competitive gap vs. OpenAI Evals.

Changes:

Introduces a full design doc for registry refs (host/path@version#subpath), caching, lockfiles, auth, and transitive deps (#15).
Specifies CLI UX for discovery/composition (waza registry search/add/get/sync/list) and deep-merge override rules (#17).
Proposes a plugin model and security posture (WGP/1 + WASM sandbox) with a phased rollout plan (#18).

Show a summary per file

File	Description
docs/design/13-eval-registry.md	New design doc describing the eval/grader registry architecture, UX, security model, and rollout phases.

Copilot's findings

Files reviewed: 1/1 changed files
Comments generated: 4

+ref     = host "/" path [ "@" version ] [ "#" subpath ]
+host    = DNS-1123 hostname (e.g. github.com, gitlab.example.com)
+path    = 1*( segment "/" ) segment
+version = semver | "latest" | "main" | git-sha-7..40


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Copilot's findings

Files reviewed: 1/1 changed files
Comments generated: 2

+**Status:** Draft for review
+**Date:** 2026-06-19
+
+> **Note on location:** Issue #13 originally specified `docs/research/waza-eval-registry-design.md`, but that directory is gitignored (`.gitignore` line 116, "Internal research docs"). This doc lives under `docs/design/` to match the established convention (`docs/design/135-improve-concurrency.md`, `docs/design/194-baseline-skill-impact.md`) and to ensure it ships in the repo. Validation comment on #13 explicitly asked which location is canonical; this PR proposes `docs/design/` as the answer.


+## 10. Compatibility considerations
+
+- **Backward compatible.** Existing `eval.yaml` files keep working: the `ref:` field is new and additive. `GraderConfig.UnmarshalYAML` already uses `KnownFields(true)`, so we'll need to add `ref` to the raw struct; no other change is required for existing graders.
+- **JSON schema.** `schemas/eval.schema.json` (referenced by `site/src/content/docs/reference/schema.mdx`) needs a new union: either the existing typed grader or `{ ref: string, name?, weight?, model?, config? }`. We can express this as `oneOf` keyed on the presence of `ref`.


spboyer

Adds a registry design doc; the structure is solid, but several spec details would mislead or weaken implementation.

Issues to address:

docs/design/13-eval-registry.md:119 - subpath grammar lacks explicit traversal and symlink-escape rejection
docs/design/13-eval-registry.md:441 - compromised-index mitigation overstates first-use integrity guarantees
docs/design/13-eval-registry.md:158 - cache path is Linux/XDG-only instead of OS cache-dir based
docs/design/13-eval-registry.md:176 - GitLab credential command does not return a token
docs/design/13-eval-registry.md:348 - wasmtime-go static dependency claim misses CGO/cross-compile tradeoffs
docs/design/13-eval-registry.md:477 - GitHub refs rely on remote git archive, which GitHub does not support
docs/design/13-eval-registry.md:309 - summary says one runtime but design uses WASM plus program runtimes

spboyer · 2026-06-19T18:11:57Z

+host    = DNS-1123 hostname (e.g. github.com, gitlab.example.com)
+path    = 1*( segment "/" ) segment
+version = semver | "latest" | "main" | git-sha-7..40
+subpath = POSIX path relative to repo root


[HIGH] Security: #subpath is only defined as a POSIX path relative to the repo root, but the design never requires normalization or rejection of .., absolute paths, or symlink escapes. Resolver implementations could extract or read files outside the artifact/cache root when resolving subdirectory refs. Specify that subpaths are cleaned and must remain under the resolved artifact root after symlink evaluation.

spboyer · 2026-06-19T18:11:57Z

+| Private repo token leakage | Tokens read from env or `gh auth`; never written to lockfile or results.json; redacted in `--verbose` logs. |
+| Cache poisoning | Content-addressed cache (sha256); on read, digest is re-verified before use. |
+| Typosquatting in federated index | Hosts are explicit (`github.com/waza-evals/...`); no short names; `waza registry add` shows the full ref and (post-v1) signature info before writing the lockfile. |
+| Compromised index URL | Each entry's `digest`/`source` is verified at fetch time, so a hostile index can only deny service or surface known-good artifacts — it cannot substitute content. |


[HIGH] Security: The compromised-index row overstates what digest verification can guarantee on first use. If a hostile index supplies both source and digest, fetching bytes and verifying them against that digest only proves the artifact matches the hostile metadata, not that it is trusted. Clarify that lockfiles protect subsequent resolutions and that first-use substitution requires trusted metadata/signatures or explicit user trust.

spboyer · 2026-06-19T18:11:57Z

+### 5.3 Cache layout
+
+```
+$XDG_CACHE_HOME/waza/registry/


[MEDIUM] Cross-platform portability: The cache layout is specified as $XDG_CACHE_HOME/..., which is Linux-specific. Waza is a cross-platform Go CLI, so literal XDG semantics will either diverge on macOS/Windows or place cache data in non-idiomatic locations. Define the root as Go's os.UserCacheDir() with per-OS examples instead.

spboyer · 2026-06-19T18:11:57Z

+| Host pattern | Default credential source | Override |
+|--------------|---------------------------|----------|
+| `github.com/*` | `gh auth token` (if installed); else `GITHUB_TOKEN` env | `WAZA_REGISTRY_TOKEN_GITHUB_COM` |
+| `gitlab.*` | `glab auth status` (if installed); else `GITLAB_TOKEN` | `WAZA_REGISTRY_TOKEN_<HOST_UPPER_DOTS_TO_UNDERSCORE>` |


[MEDIUM] Documentation: glab auth status verifies login state but does not return a token, so the credential source described here is not implementable as written. This would break GitLab private registry auth if copied directly into Phase 2. Use glab auth token or glab auth status -t for parity with the GitHub row.

Suggested change

| `gitlab.*` | `glab auth status` (if installed); else `GITLAB_TOKEN` | `WAZA_REGISTRY_TOKEN_<HOST_UPPER_DOTS_TO_UNDERSCORE>` |

| `gitlab.*` | `glab auth token` (if installed); else `GITLAB_TOKEN` | `WAZA_REGISTRY_TOKEN_<HOST_UPPER_DOTS_TO_UNDERSCORE>` |

spboyer · 2026-06-19T18:11:57Z

+
+### 7.3 WASM runtime details
+
+- Runtime: `wasmtime-go` (mature, sandboxed, cross-platform, single static dependency).


[MEDIUM] Documentation: wasmtime-go is described as a single static dependency, but it is a CGO binding to the Wasmtime C API. That tradeoff affects static builds and cross-compilation for a Go CLI, so leaving it out makes the Phase 3 runtime choice look lower-risk than it is. Either acknowledge the CGO/release-artifact impact or compare wazero as the pure-Go alternative.

spboyer · 2026-06-19T18:11:57Z

+
+### Phase 2 — Git backend + auth + cache (4–6 weeks)
+
+- Add `git` backend resolving `github.com/owner/repo@version` via `git archive` (no full clone) for tags, fallback to `git ls-remote` + sparse fetch for branches/SHAs.


[MEDIUM] Documentation: The Git backend plan says GitHub refs resolve via git archive, but GitHub does not support the remote git archive protocol. Implementing this literally will fail for the primary github.com/owner/repo@version case. Use GitHub archive/tarball downloads for GitHub, or make shallow/sparse fetch the primary path.

spboyer · 2026-06-19T18:11:57Z

+| Embedded scripting (Lua/Starlark/JS) | ✅ | ⚠️ Sandbox quality varies | Slower than native | Easy (text artifact) | ❌ Rejected (yet-another-language; WASM gives us all of these via guest toolchains) |
+| In-proc Python | ✅ Where python exists | ❌ | Slow startup | Hard (env hell) | ❌ Rejected (replicates OpenAI Evals' main pain point) |
+
+Two formats, one protocol, one runtime, no language lock-in.


[LOW] Documentation: "one runtime" contradicts the design in §7.2/§7.3, which has a WASM runtime plus an external program runner sharing WGP/1. That makes the summary misleading when readers compare it to the detailed architecture. Change it to "two formats, one protocol, two runtimes, no language lock-in."

Copilot AI review requested due to automatic review settings June 19, 2026 13:11

Copilot started reviewing on behalf of spboyer June 19, 2026 13:12 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

spboyer and others added 2 commits June 19, 2026 18:58

Merge branch 'main' into spboyer/issue-13-eval-registry-design

d34c027

Potential fix for pull request finding

0bb6a0b

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 19, 2026 17:59

spboyer and others added 2 commits June 19, 2026 18:59

Potential fix for pull request finding

d326f69

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

b0bb87c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot started reviewing on behalf of spboyer June 19, 2026 17:59 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

spboyer commented Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Eval & Grader Registry design doc (#13)#337

docs: Eval & Grader Registry design doc (#13)#337
spboyer wants to merge 5 commits into
mainfrom
spboyer/issue-13-eval-registry-design

spboyer commented Jun 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

spboyer left a comment

Uh oh!

spboyer Jun 19, 2026

Uh oh!

spboyer Jun 19, 2026

Uh oh!

spboyer Jun 19, 2026

Uh oh!

spboyer Jun 19, 2026

Uh oh!

spboyer Jun 19, 2026

Uh oh!

spboyer Jun 19, 2026

Uh oh!

spboyer Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	\| `gitlab.*` \| `glab auth status` (if installed); else `GITLAB_TOKEN` \| `WAZA_REGISTRY_TOKEN_<HOST_UPPER_DOTS_TO_UNDERSCORE>` \|
	\| `gitlab.*` \| `glab auth token` (if installed); else `GITLAB_TOKEN` \| `WAZA_REGISTRY_TOKEN_<HOST_UPPER_DOTS_TO_UNDERSCORE>` \|


		### 7.3 WASM runtime details

		- Runtime: `wasmtime-go` (mature, sandboxed, cross-platform, single static dependency).


		### Phase 2 — Git backend + auth + cache (4–6 weeks)

		- Add `git` backend resolving `github.com/owner/repo@version` via `git archive` (no full clone) for tags, fallback to `git ls-remote` + sparse fetch for branches/SHAs.

Conversation

spboyer commented Jun 19, 2026

What's inside

Path note

Out of scope (intentional)

Review asks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

spboyer left a comment

Choose a reason for hiding this comment

Uh oh!

spboyer Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

spboyer Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

spboyer Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

spboyer Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

spboyer Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

spboyer Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

spboyer Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants