From 2e0728ca1343d9dab6b2104c4b34719b9550e210 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 17 Jun 2026 11:15:31 +0200 Subject: [PATCH] docs(workspace): document Harbor execution boundary --- .../docs/docs/guides/benchmark-provenance.mdx | 56 +++++++ .../docs/guides/workspace-architecture.mdx | 19 +++ docs/adr/2026-06-17-harbor-runner-boundary.md | 153 ++++++++++++++++++ 3 files changed, 228 insertions(+) create mode 100644 docs/adr/2026-06-17-harbor-runner-boundary.md diff --git a/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx b/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx index 976a4f37b..cd0f8936a 100644 --- a/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx +++ b/apps/web/src/content/docs/docs/guides/benchmark-provenance.mdx @@ -141,6 +141,62 @@ acquisition remains outside the eval; use registered projects or `git_cache.mirrors` when a local machine needs faster large-repo setup. See [Workspace Architecture](/docs/guides/workspace-architecture/#repo-provenance-vs-acquisition). +## Native AgentV vs Harbor-backed Benchmarks + +Use native AgentV workspaces for repo-backed evals where AgentV should own the +run lifecycle: materialize generic repos, run targets, execute hooks and graders, +gate CI, and write AgentV result bundles. This fits custom internal suites, +target comparisons, narrow regression suites, and CI checks built from AgentV +primitives. + +```yaml +name: repo-regressions + +workspace: + isolation: per_test + repos: + - path: ./repo + repo: https://github.com/example/widget.git + commit: 4f3e2d19b6e4e8f1c2b7d9a0e5a6b7c8d9e0f123 + hooks: + before_each: + command: ["python", "./scripts/apply-case-fixtures.py"] + +execution: + targets: [codex, claude] + +assertions: + - name: tests-pass + type: code-grader + command: ["python", "./graders/run-tests.py"] + required: true +``` + +Use a Harbor-backed runner for standard benchmark suites Harbor owns, such as +SWE-Bench Verified, Multi-SWE-Bench, Terminal-Bench, or suites with Harbor-owned +Docker and Compose adapters. In that path AgentV should stay at the +orchestration boundary: launch or import the Harbor job, apply AgentV gates to +the imported results, and link Opik traces when Harbor uploads them. + +```yaml +# Proposed runner boundary, not a current AgentV task schema. +name: swebench-verified-codex + +execution: + runner: harbor + harbor: + dataset: swebench-verified + agent: codex + model: openai/gpt-5-mini + opik: + enabled: true +``` + +Do not translate Harbor `task.toml`, verifier packaging, or suite-specific +Docker/Compose adapter fields into AgentV core eval schema. If the benchmark's +runtime contract is already owned by Harbor, keep those details in Harbor and +let AgentV consume the job metadata, rewards, artifacts, and trace links. + ## Finance-Style Generated Dataset Generated datasets often need stable row provenance more than workspace setup. diff --git a/apps/web/src/content/docs/docs/guides/workspace-architecture.mdx b/apps/web/src/content/docs/docs/guides/workspace-architecture.mdx index 4a180200b..dd047dc4c 100644 --- a/apps/web/src/content/docs/docs/guides/workspace-architecture.mdx +++ b/apps/web/src/content/docs/docs/guides/workspace-architecture.mdx @@ -84,11 +84,30 @@ Supported repo fields: | `sparse` | Optional sparse-checkout paths | | `ancestor` | Walk N parents back after resolving `commit` / `base_commit` | +`commit` is the canonical AgentV checkout pin. `base_commit` exists only as a +SWE-Bench-friendly alias for the same value; when both fields are present they +must match. Prefer `commit` in new AgentV-authored evals unless preserving an +upstream dataset column name makes the eval easier to audit. + `source`, `type`, `checkout`, `checkout.resolve`, and `clone` are not part of the repo schema. Acquisition settings are deliberately outside eval YAML so the same benchmark can run against the same repository identity on every machine while each harness uses the fastest safe local source available. +## Native workspace boundary + +Use native AgentV workspaces when AgentV owns the run lifecycle: custom internal +suites, CI gates, target comparisons, pooled workspaces, local setup hooks, +Docker workspaces, and generic repository acquisition. In that path, +`workspace.repos` declares the repos and checkout pins while AgentV materializes +the workspace, runs targets and graders, and writes AgentV run bundles. + +Use a Harbor-backed runner boundary for standard benchmark suites whose +acquisition, packaging, verifier layout, Docker or Compose adapters, and trace +export are already owned by Harbor. In that path, AgentV should launch, import, +and gate Harbor jobs and link Opik traces. It should not copy Harbor `task.toml` +or suite-specific adapter fields into AgentV core workspace schema. + ## Acquisition resolver AgentV normalizes `repo` identity before acquisition. For example, diff --git a/docs/adr/2026-06-17-harbor-runner-boundary.md b/docs/adr/2026-06-17-harbor-runner-boundary.md new file mode 100644 index 000000000..4ca7749c8 --- /dev/null +++ b/docs/adr/2026-06-17-harbor-runner-boundary.md @@ -0,0 +1,153 @@ +# ADR: Keep Harbor benchmark execution behind a runner boundary + +Date: 2026-06-17 + +Status: Proposed + +## Context + +AgentV now has native workspace repository acquisition for custom evals, CI +gates, target comparisons, pooled workspaces, hooks, and Docker workspace cases. +That should remain generic infrastructure: `workspace.repos[].commit` is the +canonical checkout pin, and `workspace.repos[].base_commit` is only a +SWE-Bench-friendly alias for the same value. + +Harbor owns benchmark-grade execution for standard suites such as SWE-Bench +Verified, Multi-SWE-Bench, Terminal-Bench, and suites with Harbor-specific +Docker or Compose adapters. Those suites carry runtime contracts that should not +be copied into AgentV core: task packaging, verifier images, adapter flags, +result artifacts, and Opik upload behavior. + +## Decision + +AgentV should support Harbor-backed benchmark execution as a runner boundary, +not as new AgentV task schema. + +AgentV core should own: + +- native workspaces and generic repo acquisition; +- AgentV run bundle writing and result import; +- CI gates and comparisons over imported metrics; +- links from AgentV results to Harbor jobs and Opik traces. + +Harbor should own: + +- benchmark dataset acquisition and task packaging; +- suite-specific runtime adapters and verifier images; +- Harbor `task.toml` files and Harbor YAML config; +- Opik trace upload through Harbor when enabled. + +## Minimal future config surface + +The AgentV eval file should select Harbor with a nested runner config: + +```yaml +name: swebench-verified-codex + +execution: + runner: harbor + harbor: + dataset: swebench-verified + agent: codex + model: openai/gpt-5-mini + opik: + enabled: true +``` + +For a Harbor-authored YAML file, use `config` instead of `dataset`: + +```yaml +execution: + runner: harbor + harbor: + config: ./harbor/swebench-verified.yaml +``` + +The first implementation should accept exactly one Harbor source selector: +`dataset` for a known Harbor dataset id, or `config` for an existing Harbor YAML +file. There should be no precedence rule between them. If both are set, fail +validation and ask the user to choose one. + +Keep Harbor-specific options nested under `execution.harbor`. Do not add +top-level AgentV fields for Harbor task packaging, verifier images, task patches, +or Docker/Compose adapter settings. If a Harbor option becomes too specific to +standardize, users should put it in the referenced Harbor YAML file instead of +AgentV adding a pass-through field. + +## CLI invocation strategy + +Native evals continue to run with the existing command: + +```bash +agentv eval evals/native.eval.yaml --target codex +``` + +Harbor-backed evals should use the same top-level entrypoint and dispatch based +on `execution.runner`: + +```bash +agentv eval evals/swebench-harbor.eval.yaml +``` + +The Harbor runner should: + +1. validate the nested Harbor config; +2. launch Harbor through its CLI or API; +3. record the Harbor job id in the AgentV run metadata; +4. wait for completion unless a future async mode is explicitly added; +5. import Harbor outputs into an AgentV run bundle; +6. evaluate AgentV gates against the imported results. + +Importing an already-completed Harbor job can be a separate follow-up command: + +```bash +agentv results import harbor --job +``` + +Do not overload native `--target` semantics in the first Harbor runner slice. +Harbor `agent`, `model`, and matrix behavior should come from +`execution.harbor` or the referenced Harbor YAML until repeated usage proves a +shared AgentV flag is needed. + +## Unsupported fields and non-goals + +The Harbor runner mode should not add or interpret: + +- Harbor `task.toml` as AgentV eval schema; +- `workspace.repos` rows for Harbor-owned suite task acquisition; +- `workspace.docker` verifier image fields for Harbor-owned suite execution; +- per-case SWE-Bench fields such as `test_patch`, `fail_to_pass_tests`, or + `base_commit` as Harbor runner inputs; +- generic `extra_args` or arbitrary pass-through maps in the initial AgentV + surface. + +These fields remain valid in native AgentV evals when authors compose their own +workspace, hooks, and graders. They are non-goals only for the Harbor-backed +standard-suite path. + +## Implementation sequencing + +1. Document the native-vs-Harbor boundary and commit alias rules. +2. Add schema validation for optional `execution.runner` and + `execution.harbor`, with no changes to native workspace acquisition. +3. Add a Harbor launch adapter that records job identity and status. +4. Add a Harbor result importer that maps rewards, exceptions, timings, + artifacts, and Opik trace URLs into AgentV run bundles. +5. Apply existing AgentV gates and comparison primitives to imported Harbor + results. +6. Add optional async job polling only after the synchronous path is proven. + +## Consequences + +Positive: + +- Keeps AgentV core lightweight and generic. +- Preserves native workspace acquisition for custom and CI-oriented evals. +- Lets Harbor evolve benchmark adapters without AgentV schema churn. +- Gives AgentV a clear place to gate, compare, and display Harbor results. + +Negative: + +- Standard-suite users need Harbor installed or reachable. +- Harbor runner implementation cannot reuse all native `workspace` features. +- Some future CLI ergonomics may require a second decision after real usage.