feat(harness): add Harbor (Terminus 2) agent harness#220
Merged
Conversation
Add the 'harbor' harness driving Harbor's first-party Terminus 2 terminal agent. Harbor (Apache-2.0) is a LiteLLM-based agent-evaluation framework; its only built-in LLM agent, Terminus 2, is a tmux/terminal agent with no native browser/CDP support. The harness bridges this by: - Running the real Terminus 2 agent against a minimal LocalEnvironment whose exec() runs in the already-running ClawBench container (no nested sandbox). - Exposing the agent-browser CDP CLI (as used by the hermes harness) in Terminus' shell, pointed at the shared Chrome at 127.0.0.1:9222, with a system-prompt prelude teaching the agent to drive it. Actions flow through CDP -> recorder extension -> /data/actions.jsonl; the eval interceptor's /data/.stop-requested still applies. - Mapping api_types to LiteLLM models like the pi harness (gemini/openai/ anthropic/openrouter + api_base), with native Gemini routing via GEMINI_API_KEY. - Installing Harbor into a uv-managed Python 3.12 venv since the base image is 3.11 and Harbor requires >=3.12. - Promoting Terminus' trajectory.json into /data/agent-messages.jsonl. Register 'harbor' in HARNESSES and the docker dockerfile map.
…ect conflict Smoke testing on a live container surfaced two issues: - agent-browser rejects --cdp and --auto-connect when set together; the `ab` wrapper already pins --cdp 9222, so drop AGENT_BROWSER_AUTO_CONNECT from the agent's extra_env. Before this fix the model wasted several turns fighting the conflict before falling back to bare agent-browser. - The first `ab` command paid the agent-browser daemon cold-start cost, which a weak model could misread as a failure. Warm up the daemon (ab open about:blank) in run-harbor.sh after CDP is ready, failing fast with agent_browser_cdp_failed if the bridge cannot attach. Also tighten the system-prompt prelude: tell the agent ab is already connected (no --cdp), use 2-3s load durations, and not to mark the task complete before verifying with a snapshot. Verified end-to-end: Terminus 2 (openrouter/openai/gpt-oss-120b:free via the LiteLLM backend) ran 'ab open' / 'ab snapshot -i' / 'ab get text' on the first try, the pageLoad was recorded to /data/actions.jsonl through the ClawBench recorder extension, and the transcript was promoted to agent-messages.jsonl.
…ess registry Resolves PR #220 conflicts after main migrated harness registration to HARNESS_REGISTRY/harnesses.yaml. Registers harbor in harnesses.yaml, adds harbor/usage-emitter.py, wires it in run-harbor.sh, and updates the registry contract test (EXPECTED_HARNESSES/EXTRA_FILES/AGENT_MESSAGE_SOURCES).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Harbor (Terminus 2) as a ClawBench agent harness
Implements the harness side of #218.
What Harbor is
Harbor (Apache-2.0, Python, LiteLLM-based) is the Terminal-Bench team's agent-evaluation framework. Its first-party agent is Terminus 2, a terminal/tmux agent — Harbor ships no native browser/CDP support.
How this harness bridges it (honest, not forced)
LocalEnvironment(BaseEnvironment)whoseexec()runs locally — the ClawBench container is the sandbox (no nested container).agent-browserCDP CLI (the same tool thehermesharness uses) via a thinabwrapper pinned to--cdp 9222, plus a system-prompt prelude. Browser actions flow CDP → ClawBench recorder extension →/data/actions.jsonl; the eval interceptor's/data/.stop-requestedstill applies.Changes
+ src/clawbench/runtime/harnesses/harbor/{Dockerfile.harbor, setup-harbor.sh, run-harbor.sh, harbor_driver.py}config.py: register"harbor"inHARNESSESdocker.py: add dockerfile map entry/opt/harbor-venv), base interpreter untouched.pi(gemini/,openai/,anthropic/,openrouter/+ api_base). Gemini gotcha handled:…/v1beta/openaibase_url maps to LiteLLM nativegemini/<model>withGEMINI_API_KEYand no api_base (avoids the/chat/completions404).Smoke test
Built
clawbench-harbor; ran one task ("open example.com, report heading + link text") withopenai/gpt-oss-120b:free→ PASS (Terminus issuedab open/ab snapshot/ab get text, extracted the text,actions.jsonlrecordedpageLoad). One container only; the running batch was untouched.Caveats / follow-ups
agent-messages.jsonlis shell-shaped (not browser-tool-shaped);actions.jsonlis identical to other harnesses (from the recorder).agent-browser@0.26.0andharbor==0.13.1are pinned.