Skip to content

feat(harness): add Harbor (Terminus 2) agent harness#220

Merged
Perry2004 merged 5 commits into
mainfrom
feat/harbor-harness
Jun 5, 2026
Merged

feat(harness): add Harbor (Terminus 2) agent harness#220
Perry2004 merged 5 commits into
mainfrom
feat/harbor-harness

Conversation

@reacher-z

Copy link
Copy Markdown
Collaborator

Add Harbor (Terminus 2) as a ClawBench agent harness

Implements the harness side of #218.

What Harbor is

Harbor (Apache-2.0, Python, LiteLLM-based) is the Terminal-Bench team's agent-evaluation framework. Its first-party agent is Terminus 2, a terminal/tmux agent — Harbor ships no native browser/CDP support.

How this harness bridges it (honest, not forced)

  • Runs the real Terminus 2 agent (real Harbor LiteLLM backend + tmux loop) against a minimal LocalEnvironment(BaseEnvironment) whose exec() runs locally — the ClawBench container is the sandbox (no nested container).
  • Gives Terminus the agent-browser CDP CLI (the same tool the hermes harness uses) via a thin ab wrapper pinned to --cdp 9222, plus a system-prompt prelude. Browser actions flow CDP → ClawBench recorder extension → /data/actions.jsonl; the eval interceptor's /data/.stop-requested still applies.

Changes

  • + src/clawbench/runtime/harnesses/harbor/{Dockerfile.harbor, setup-harbor.sh, run-harbor.sh, harbor_driver.py}
  • config.py: register "harbor" in HARNESSES
  • docker.py: add dockerfile map entry
  • Harbor needs Python ≥3.12 (base image is 3.11) → installed into a uv-managed 3.12 venv (/opt/harbor-venv), base interpreter untouched.
  • api_type mapping mirrors pi (gemini/, openai/, anthropic/, openrouter/ + api_base). Gemini gotcha handled: …/v1beta/openai base_url maps to LiteLLM native gemini/<model> with GEMINI_API_KEY and no api_base (avoids the /chat/completions 404).

Smoke test

Built clawbench-harbor; ran one task ("open example.com, report heading + link text") with openai/gpt-oss-120b:freePASS (Terminus issued ab open/ab snapshot/ab get text, extracted the text, actions.jsonl recorded pageLoad). One container only; the running batch was untouched.

Caveats / follow-ups

  • Terminus drives the browser via shell commands, so agent-messages.jsonl is shell-shaped (not browser-tool-shaped); actions.jsonl is identical to other harnesses (from the recorder).
  • Free models were flaky in testing — use a paid/stronger model for real evals.
  • agent-browser@0.26.0 and harbor==0.13.1 are pinned.

reacher-z added 2 commits June 4, 2026 04:30
Add the 'harbor' harness driving Harbor's first-party Terminus 2 terminal
agent. Harbor (Apache-2.0) is a LiteLLM-based agent-evaluation framework; its
only built-in LLM agent, Terminus 2, is a tmux/terminal agent with no native
browser/CDP support. The harness bridges this by:

- Running the real Terminus 2 agent against a minimal LocalEnvironment whose
  exec() runs in the already-running ClawBench container (no nested sandbox).
- Exposing the agent-browser CDP CLI (as used by the hermes harness) in
  Terminus' shell, pointed at the shared Chrome at 127.0.0.1:9222, with a
  system-prompt prelude teaching the agent to drive it. Actions flow through
  CDP -> recorder extension -> /data/actions.jsonl; the eval interceptor's
  /data/.stop-requested still applies.
- Mapping api_types to LiteLLM models like the pi harness (gemini/openai/
  anthropic/openrouter + api_base), with native Gemini routing via GEMINI_API_KEY.
- Installing Harbor into a uv-managed Python 3.12 venv since the base image is
  3.11 and Harbor requires >=3.12.
- Promoting Terminus' trajectory.json into /data/agent-messages.jsonl.

Register 'harbor' in HARNESSES and the docker dockerfile map.
…ect conflict

Smoke testing on a live container surfaced two issues:

- agent-browser rejects --cdp and --auto-connect when set together; the `ab`
  wrapper already pins --cdp 9222, so drop AGENT_BROWSER_AUTO_CONNECT from the
  agent's extra_env. Before this fix the model wasted several turns fighting the
  conflict before falling back to bare agent-browser.
- The first `ab` command paid the agent-browser daemon cold-start cost, which a
  weak model could misread as a failure. Warm up the daemon (ab open about:blank)
  in run-harbor.sh after CDP is ready, failing fast with agent_browser_cdp_failed
  if the bridge cannot attach.

Also tighten the system-prompt prelude: tell the agent ab is already connected
(no --cdp), use 2-3s load durations, and not to mark the task complete before
verifying with a snapshot.

Verified end-to-end: Terminus 2 (openrouter/openai/gpt-oss-120b:free via the
LiteLLM backend) ran 'ab open' / 'ab snapshot -i' / 'ab get text' on the first
try, the pageLoad was recorded to /data/actions.jsonl through the ClawBench
recorder extension, and the transcript was promoted to agent-messages.jsonl.
reacher-z and others added 3 commits June 4, 2026 07:10
…ess registry

Resolves PR #220 conflicts after main migrated harness registration to
HARNESS_REGISTRY/harnesses.yaml. Registers harbor in harnesses.yaml,
adds harbor/usage-emitter.py, wires it in run-harbor.sh, and updates the
registry contract test (EXPECTED_HARNESSES/EXTRA_FILES/AGENT_MESSAGE_SOURCES).

@Perry2004 Perry2004 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM🫡

@Perry2004 Perry2004 merged commit ea5a901 into main Jun 5, 2026
8 checks passed
@Perry2004 Perry2004 deleted the feat/harbor-harness branch June 5, 2026 02:08
@github-project-automation github-project-automation Bot moved this from Todo to Done in ClawBench Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add full support for the Harbor agent framework (harbor-framework/harbor)

2 participants