Skip to content

docs: Eval & Grader Registry design doc (#13)#337

Draft
spboyer wants to merge 5 commits into
mainfrom
spboyer/issue-13-eval-registry-design
Draft

docs: Eval & Grader Registry design doc (#13)#337
spboyer wants to merge 5 commits into
mainfrom
spboyer/issue-13-eval-registry-design

Conversation

@spboyer

@spboyer spboyer commented Jun 19, 2026

Copy link
Copy Markdown
Member

Closes #13. Refs #15, #17, #18.

Design-only doc for a shared eval & grader registry — waza's #1 competitive gap vs. OpenAI Evals. No code changes.

What's inside

docs/design/13-eval-registry.md covers:

  • feat: Go-module-style grader/eval references #15 — Go-module-style refs: host/path@version#subpath syntax, SemVer + eval.lock.yaml for reproducibility, content-addressed cache, gh auth token / env-var auth, flat transitive-dep resolution.
  • feat: Composable eval construction from registry graders #17 — Composable construction: waza registry search/add/get/sync/list CLI, federated index file, deep-merge override rules, waza init --grader scaffolding.
  • feat: Grader plugin extensibility (WASM/external programs) #18 — Plugin extensibility: WGP/1 (Waza Grader Protocol) over two runtimes — WASM (sandboxed via wasmtime, preferred for registry-distributed graders) and program (formalized bring-your-own-binary). Go plugins and embedded scripting rejected with rationale.
  • Security model (digest pinning, sandbox limits, no-secret lockfile), backward-compat impact (additive ref: field on GraderConfig, schema update, results.json source field), and a 5-phase rollout where each phase is independently shippable:
    1. Spec & schema
    2. Local-only resolver + program runtime (validates the contract without picking a backend)
    3. Git backend + auth + cache
    4. WASM runtime + sandbox
    5. Hardening (sigstore, vet, OCI backend)
  • Decision matrix (D1–D10), open questions, rejected alternatives, example end-state eval.yaml and eval.lock.yaml.

Path note

Issue body specified docs/research/waza-eval-registry-design.md, but docs/research/ is gitignored (.gitignore line 116, "Internal research docs"). The validation comment on #13 explicitly asked which location is canonical (docs/design/ vs docs/plans/). I placed the doc at docs/design/13-eval-registry.md to match existing convention (135-improve-concurrency.md, 194-baseline-skill-impact.md). Happy to move if the team prefers docs/plans/.

Out of scope (intentional)

Review asks

  • Sanity-check the WASM-vs-program split (D9, §7.1). The rationale for rejecting Go plugins and embedded scripting is in §13.
  • Confirm the lockfile + --frozen for CI approach matches how teams want to consume registry graders.
  • Confirm the auth model (D4): gh auth token for GitHub by default, env-var overrides per host, never store secrets in ~/.waza/credentials.yaml.
  • Decide canonical doc location (docs/design/ is my recommendation).

Adds docs/design/13-eval-registry.md covering the design for a
shared eval and grader registry. Design-only; no implementation.

Note: issue #13 asked for docs/research/, but that path is
gitignored. Placed in docs/design/ to match existing convention
(135-improve-concurrency.md, 194-baseline-skill-impact.md) and
to answer the open question from the issue validation comment.

Decisions cover sub-issues:
- #15 Go-module-style refs: ref syntax, SemVer + lockfile,
  content-addressed cache, gh/env auth, flat transitive deps.
- #17 Composable eval construction: registry search/add/get/sync,
  deep-merge override rules, waza init --grader scaffolding.
- #18 Plugin extensibility: WGP/1 protocol over WASM (sandboxed)
  and program (bring-your-own-binary); Go plugins and embedded
  scripting rejected with rationale.

Includes security model, backward-compat impact, a 5-phase
rollout (spec, local resolver, git backend, WASM runtime,
hardening), open questions, rejected alternatives, and example
end-state YAML + lockfile. Backend selection (#16) deferred to
start of Phase 2.

Refs #13 #15 #17 #18
Copilot AI review requested due to automatic review settings June 19, 2026 13:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a design document proposing a shared eval & grader registry for waza, covering reference syntax/lockfiles, registry discovery & composition UX, and an extensibility model (WASM + external program protocol) intended to close the “shared registry” competitive gap vs. OpenAI Evals.

Changes:

  • Introduces a full design doc for registry refs (host/path@version#subpath), caching, lockfiles, auth, and transitive deps (#15).
  • Specifies CLI UX for discovery/composition (waza registry search/add/get/sync/list) and deep-merge override rules (#17).
  • Proposes a plugin model and security posture (WGP/1 + WASM sandbox) with a phased rollout plan (#18).
Show a summary per file
File Description
docs/design/13-eval-registry.md New design doc describing the eval/grader registry architecture, UX, security model, and rollout phases.

Copilot's findings

  • Files reviewed: 1/1 changed files
  • Comments generated: 4

Comment thread docs/design/13-eval-registry.md Outdated
Comment thread docs/design/13-eval-registry.md Outdated
ref = host "/" path [ "@" version ] [ "#" subpath ]
host = DNS-1123 hostname (e.g. github.com, gitlab.example.com)
path = 1*( segment "/" ) segment
version = semver | "latest" | "main" | git-sha-7..40
Comment thread docs/design/13-eval-registry.md Outdated
spboyer and others added 2 commits June 19, 2026 18:58
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 19, 2026 17:59
spboyer and others added 2 commits June 19, 2026 18:59
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 1/1 changed files
  • Comments generated: 2

**Status:** Draft for review
**Date:** 2026-06-19

> **Note on location:** Issue #13 originally specified `docs/research/waza-eval-registry-design.md`, but that directory is gitignored (`.gitignore` line 116, "Internal research docs"). This doc lives under `docs/design/` to match the established convention (`docs/design/135-improve-concurrency.md`, `docs/design/194-baseline-skill-impact.md`) and to ensure it ships in the repo. Validation comment on #13 explicitly asked which location is canonical; this PR proposes `docs/design/` as the answer.
## 10. Compatibility considerations

- **Backward compatible.** Existing `eval.yaml` files keep working: the `ref:` field is new and additive. `GraderConfig.UnmarshalYAML` already uses `KnownFields(true)`, so we'll need to add `ref` to the raw struct; no other change is required for existing graders.
- **JSON schema.** `schemas/eval.schema.json` (referenced by `site/src/content/docs/reference/schema.mdx`) needs a new union: either the existing typed grader or `{ ref: string, name?, weight?, model?, config? }`. We can express this as `oneOf` keyed on the presence of `ref`.

@spboyer spboyer left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adds a registry design doc; the structure is solid, but several spec details would mislead or weaken implementation.

Issues to address:

  • docs/design/13-eval-registry.md:119 - subpath grammar lacks explicit traversal and symlink-escape rejection
  • docs/design/13-eval-registry.md:441 - compromised-index mitigation overstates first-use integrity guarantees
  • docs/design/13-eval-registry.md:158 - cache path is Linux/XDG-only instead of OS cache-dir based
  • docs/design/13-eval-registry.md:176 - GitLab credential command does not return a token
  • docs/design/13-eval-registry.md:348 - wasmtime-go static dependency claim misses CGO/cross-compile tradeoffs
  • docs/design/13-eval-registry.md:477 - GitHub refs rely on remote git archive, which GitHub does not support
  • docs/design/13-eval-registry.md:309 - summary says one runtime but design uses WASM plus program runtimes

host = DNS-1123 hostname (e.g. github.com, gitlab.example.com)
path = 1*( segment "/" ) segment
version = semver | "latest" | "main" | git-sha-7..40
subpath = POSIX path relative to repo root

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Security: #subpath is only defined as a POSIX path relative to the repo root, but the design never requires normalization or rejection of .., absolute paths, or symlink escapes. Resolver implementations could extract or read files outside the artifact/cache root when resolving subdirectory refs. Specify that subpaths are cleaned and must remain under the resolved artifact root after symlink evaluation.

| Private repo token leakage | Tokens read from env or `gh auth`; never written to lockfile or results.json; redacted in `--verbose` logs. |
| Cache poisoning | Content-addressed cache (sha256); on read, digest is re-verified before use. |
| Typosquatting in federated index | Hosts are explicit (`github.com/waza-evals/...`); no short names; `waza registry add` shows the full ref and (post-v1) signature info before writing the lockfile. |
| Compromised index URL | Each entry's `digest`/`source` is verified at fetch time, so a hostile index can only deny service or surface known-good artifacts — it cannot substitute content. |

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Security: The compromised-index row overstates what digest verification can guarantee on first use. If a hostile index supplies both source and digest, fetching bytes and verifying them against that digest only proves the artifact matches the hostile metadata, not that it is trusted. Clarify that lockfiles protect subsequent resolutions and that first-use substitution requires trusted metadata/signatures or explicit user trust.

### 5.3 Cache layout

```
$XDG_CACHE_HOME/waza/registry/

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Cross-platform portability: The cache layout is specified as $XDG_CACHE_HOME/..., which is Linux-specific. Waza is a cross-platform Go CLI, so literal XDG semantics will either diverge on macOS/Windows or place cache data in non-idiomatic locations. Define the root as Go's os.UserCacheDir() with per-OS examples instead.

| Host pattern | Default credential source | Override |
|--------------|---------------------------|----------|
| `github.com/*` | `gh auth token` (if installed); else `GITHUB_TOKEN` env | `WAZA_REGISTRY_TOKEN_GITHUB_COM` |
| `gitlab.*` | `glab auth status` (if installed); else `GITLAB_TOKEN` | `WAZA_REGISTRY_TOKEN_<HOST_UPPER_DOTS_TO_UNDERSCORE>` |

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Documentation: glab auth status verifies login state but does not return a token, so the credential source described here is not implementable as written. This would break GitLab private registry auth if copied directly into Phase 2. Use glab auth token or glab auth status -t for parity with the GitHub row.

Suggested change
| `gitlab.*` | `glab auth status` (if installed); else `GITLAB_TOKEN` | `WAZA_REGISTRY_TOKEN_<HOST_UPPER_DOTS_TO_UNDERSCORE>` |
| `gitlab.*` | `glab auth token` (if installed); else `GITLAB_TOKEN` | `WAZA_REGISTRY_TOKEN_<HOST_UPPER_DOTS_TO_UNDERSCORE>` |


### 7.3 WASM runtime details

- Runtime: `wasmtime-go` (mature, sandboxed, cross-platform, single static dependency).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Documentation: wasmtime-go is described as a single static dependency, but it is a CGO binding to the Wasmtime C API. That tradeoff affects static builds and cross-compilation for a Go CLI, so leaving it out makes the Phase 3 runtime choice look lower-risk than it is. Either acknowledge the CGO/release-artifact impact or compare wazero as the pure-Go alternative.


### Phase 2 — Git backend + auth + cache (4–6 weeks)

- Add `git` backend resolving `github.com/owner/repo@version` via `git archive` (no full clone) for tags, fallback to `git ls-remote` + sparse fetch for branches/SHAs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Documentation: The Git backend plan says GitHub refs resolve via git archive, but GitHub does not support the remote git archive protocol. Implementing this literally will fail for the primary github.com/owner/repo@version case. Use GitHub archive/tarball downloads for GitHub, or make shallow/sparse fetch the primary path.

| Embedded scripting (Lua/Starlark/JS) | ✅ | ⚠️ Sandbox quality varies | Slower than native | Easy (text artifact) | ❌ Rejected (yet-another-language; WASM gives us all of these via guest toolchains) |
| In-proc Python | ✅ Where python exists | ❌ | Slow startup | Hard (env hell) | ❌ Rejected (replicates OpenAI Evals' main pain point) |

Two formats, one protocol, one runtime, no language lock-in.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Documentation: "one runtime" contradicts the design in §7.2/§7.3, which has a WASM runtime plus an external program runner sharing WGP/1. That makes the summary misleading when readers compare it to the detailed architecture. Change it to "two formats, one protocol, two runtimes, no language lock-in."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Eval & Grader Registry — design doc

3 participants