HarnessCoder is a local coding agent harness for real repository tasks. The 1.0 story is deliberately narrow:
- event-sourced agent loop
- policy-gated tools
- trace/replay/eval
- context governance: memory, compression, and RepoMap
It is not a fork of CoreCoder, not a smaller LangGraph clone, and not a web UI. The first goal is a controllable runtime that can run an agent loop, gate tool execution with policy, and write every important decision into a replayable JSONL trace.
The core loop is dynamic:
state -> model decides action -> policy checks -> tool executes
-> observation appended -> state updated -> model decides again
That shape matters because coding tasks are rarely a fixed DAG. The useful next step depends on the current repo, tool observations, failures, test output, and the model's evolving plan. A DAG or LangGraph-style workflow can be useful for the eval pipeline around the agent, but the agent itself should remain a policy-gated loop.
Version 1.0.2 is a runnable local runtime with real bugfix and minimal
greenfield eval loops, HC-Bench-20, trace replay, eval reporting,
model-profile comparison, context-governed prompt assembly, task-local memory,
compression metrics, lightweight RepoMap, checkpoint/resume support, and a
large-output artifact store for audit/replay. It includes:
- A
ScriptedModelthat simulates model actions without calling a real LLM. - Tool execution for:
read_file(path, offset=0, limit=200)search_code(query, path=".")repo_map(query=None, max_tokens=1200, refresh=false)write_file(path, content, overwrite=false)edit_file(path, old, new)run_tests(cmd=None, timeout=60)run_command(cmd, timeout=30)
- A minimal policy gate before every tool call.
- JSONL traces under
.harnesscoder/runs/<run_id>/trace.jsonl. - Large tool observations are previewed in trace/model context and persisted
under the run's
artifacts/directory with size and hash metadata. context_packed,checkpoint_created,run_resumed, andtest_resultevents for reliability-oriented replay.repo_map_builtandrepo_map_usedevents for repository-level context governance.- Trace replay summaries through
python -m harnesscoder.replay. - A minimal eval harness that runs cases, executes tests, scores results, and renders a Markdown report.
- Fixture-backed bugfix evals that copy a repo into
.harnesscoder/eval-workspaces/...before editing it. - A greenfield eval that starts from a nearly empty fixture and creates source plus tests from scratch.
- Case-level
allowed_tools,step_budget, andverifierfields inspired by benchmark harnesses such as Pico. - Model profiles and Markdown eval matrices for comparing the same cases across providers.
- HC-Bench-20: 20 fixture-backed cases across bugfix, recovery, greenfield, context-governance, and policy/safety categories.
- A deterministic
hc-bench-oracleprovider that proves the benchmark and report pipeline are solvable before comparing real models. - CLI entrypoints:
python -m harnesscoder "看一下这个 repo 是做什么的"
python -m harnesscoder --replay .harnesscoder/runs/<run_id>/trace.jsonl
python -m harnesscoder --resume .harnesscoder/runs/<run_id>/checkpoint.json
python -m harnesscoder --eval eval/cases.json
python -m harnesscoder --provider hc-bench-oracle --eval eval/hc_bench_20.jsonThe scripted model currently performs a small repo-orientation pass: search for
project mentions, read README.md, list files, and then produce a final answer.
HarnessCoder also has a lightweight standard-library terminal UI:
python -m harnesscoder --tuiInside the TUI, send a normal message to run the agent and write a new trace. The UI keeps refreshing while a run is active, shows the latest trace event in the status area, and folds the header on narrow or short terminals. Use slash commands for direct tools and runtime controls:
/help
/status
/model your-model-name
/model scripted
/provider openai-codex
/base-url https://your-openai-compatible-endpoint.example
/read README.md
/search HarnessCoder
/repo-map HarnessCoder
/edit README.md old new
/test python -m unittest discover -s tests
/run git status --short
/trace latest
The current TUI is intentionally small: it is a runnable control surface for the runtime and eval harness, not a full Claude Code clone.
HarnessCoder's context governance has three task-local layers:
- Packed context summarizes hot observations, cold trace history, modified files, and budget.
- Working memory stores task-scoped facts such as failing tests, explored files, relevant symbols, patch summaries, verified facts, and open questions.
- RepoMap builds a bounded repository index from Python AST symbols, imports, classes, functions, and fallback regex symbols for non-Python text files.
Use prompt modes to ablate these layers:
python -m harnesscoder --context-mode none "inspect this repo"
python -m harnesscoder --context-mode pack "inspect this repo"
python -m harnesscoder --context-mode memory "inspect this repo"RepoMap injection is enabled by default for pack and memory modes and can be
disabled independently:
python -m harnesscoder \
--context-mode pack \
--repo-map-mode none \
"inspect this repo"The MVP includes two optional OpenAI-compatible real-model providers:
openai-codexcalls a Responses API endpoint at/responses.openai-chatcalls a Chat Completions endpoint at/chat/completions.
Both providers ask the model to return a strict JSON action for the runtime to execute.
Keep secrets out of the repo. Configure the provider with environment variables
or a local .env file:
export OPENAI_API_KEY="<your-api-key>"
export HARNESSCODER_OPENAI_BASE_URL="https://your-openai-compatible-endpoint.example/v1"
export HARNESSCODER_OPENAI_MODEL="your-codex-model-name"
python -m harnesscoder --provider openai-codex "看一下这个 repo 是做什么的"If the base URL does not end in /v1, HarnessCoder appends /v1 before calling
/responses or /chat/completions.
DeepSeek can be configured through the Chat Completions provider. Keep the API
key in .env or your shell environment and reference the variable from
models.toml:
[models.deepseek]
provider = "openai-chat"
model = "deepseek-v4-pro"
base_url = "https://api.deepseek.com"
api_key_env = "DEEPSEEK_API_KEY"
timeout = 120
max_output_tokens = 2000Run a DeepSeek matrix:
python -m harnesscoder \
--model-config models.toml \
--model-profiles hc_bench_oracle,scripted,deepseek \
--context-mode pack \
--eval eval/hc_bench_20.json \
--max-iterations 8 \
--eval-report .harnesscoder/reports/hc-bench-20-deepseek-matrix.mdWhen launched from a repo, the CLI auto-loads .env from the current directory
and from --cwd if it is different. Existing shell environment variables win
over .env values. OPENAI_MODEL is also accepted as a fallback for
HARNESSCODER_OPENAI_MODEL.
Each run writes event records with a timestamp, run id, and event type. The runtime trace includes at least:
run_startedcontext_packedmodel_actionpolicy_decisiontool_resulttest_resultstate_updatedcheckpoint_createdrun_resumedrun_finished
These traces are intentionally append-only JSONL so later replay and eval code can consume them without depending on in-memory state.
See docs/development-process.md for the running engineering log: design decisions, bugs encountered during real provider integration, fixes, and interview-ready talking points.
For interview-facing material, see docs/showcase.md and docs/architecture.md. For release checks, see docs/release-checklist.md and docs/spec-1.0.0.md. The 1.0.1 evaluation tightening is scoped in docs/spec-1.0.1.md, and the 1.0.2 observation artifact store is scoped in docs/spec-1.0.2.md.
Replay loads a trace and reconstructs a structured summary:
python -m harnesscoder.replay .harnesscoder/runs/<run_id>/trace.jsonl
python -m harnesscoder --replay .harnesscoder/runs/<run_id>/trace.jsonlResume continues an interrupted run from the saved checkpoint:
python -m harnesscoder --resume .harnesscoder/runs/<run_id>/checkpoint.jsonEval stays workflow-shaped around the dynamic agent loop:
setup repo -> run agent -> run tests -> collect trace -> score -> report
Run the local smoke eval:
python -m harnesscoder --eval eval/cases.jsonRun one named model profile:
python -m harnesscoder \
--model-profile scripted \
--eval eval/cases.jsonRun the real bugfix loop with an OpenAI-compatible model:
export OPENAI_API_KEY="<your-api-key>"
export HARNESSCODER_OPENAI_BASE_URL="https://your-openai-compatible-endpoint.example/v1"
export HARNESSCODER_OPENAI_MODEL="your-codex-model-name"
python -m harnesscoder \
--provider openai-codex \
--eval eval/bugfix_cases.json \
--max-iterations 8 \
--eval-report .harnesscoder/reports/bugfix-demo.mdeval/bugfix_cases.json uses examples/bugfix_demo/repo as a fixture. The
eval runner copies it into an isolated .harnesscoder/eval-workspaces/...
workspace before the agent edits files, so demo fixtures remain stable.
Run the minimal greenfield loop:
python -m harnesscoder \
--provider openai-codex \
--eval eval/greenfield_cases.json \
--max-iterations 10 \
--eval-report .harnesscoder/reports/greenfield-demo.mdeval/greenfield_cases.json starts from examples/greenfield_demo/repo, which
contains no application code. The agent must create math_utils.py and
test_math_utils.py, pass python -m unittest discover, and pass a separate
verifier command. The case also declares allowed_tools and step_budget, so
the eval contract is explicit instead of hidden in prose.
Compare profiles with an eval matrix:
cp models.example.toml models.toml
# Edit models.toml locally, then keep it out of git if it contains private endpoints.
python -m harnesscoder \
--model-config models.toml \
--model-profiles hc_bench_oracle,scripted,openai_codex,deepseek \
--eval eval/hc_bench_20.json \
--max-iterations 8 \
--eval-report .harnesscoder/reports/hc-bench-20-real-matrix.mdThe matrix report compares pass rate, test pass rate, verifier pass rate, average tool calls, repeated reads, invalid calls, policy denials, tool failures, memory/compression metrics, RepoMap use/injection metrics, observation artifact metrics, and failure categories. Each profile/case run still keeps its own trace and artifact directory. If a real-model profile cannot initialize, the matrix records the profile error instead of hiding the reason.
Compare context modes:
python -m harnesscoder \
--model-config models.toml \
--model-profiles deepseek \
--context-mode none \
--eval eval/hc_bench_20.json \
--eval-report .harnesscoder/reports/hc-bench-20-real-none.md
python -m harnesscoder \
--model-config models.toml \
--model-profiles deepseek \
--context-mode pack \
--eval eval/hc_bench_20.json \
--eval-report .harnesscoder/reports/hc-bench-20-real-pack.md
python -m harnesscoder \
--model-config models.toml \
--model-profiles deepseek \
--context-mode memory \
--eval eval/hc_bench_20.json \
--eval-report .harnesscoder/reports/hc-bench-20-real-memory.mdCompare RepoMap injection:
python -m harnesscoder \
--model-config models.toml \
--model-profiles deepseek \
--context-mode pack \
--repo-map-mode none \
--eval eval/hc_bench_20.json \
--eval-report .harnesscoder/reports/hc-bench-20-without-repo-map.md
python -m harnesscoder \
--model-config models.toml \
--model-profiles deepseek \
--context-mode pack \
--repo-map-mode auto \
--eval eval/hc_bench_20.json \
--eval-report .harnesscoder/reports/hc-bench-20-with-repo-map.mdRun HC-Bench-20 with the deterministic local oracle:
python -m harnesscoder \
--provider hc-bench-oracle \
--eval eval/hc_bench_20.json \
--max-iterations 8 \
--eval-report .harnesscoder/reports/hc-bench-20-oracle.mdHC-Bench-20 is the 0.7.0 interview benchmark. It contains 20 local cases:
- 7 bugfix cases for business-like defects.
- 3 recovery cases that require a failing test and a second fix.
- 5 greenfield cases that create modules and tests through
write_file. - 2 context cases that reward search-first, bounded reads in large files.
- 3 policy cases for path traversal, command injection, and dangerous command denial.
The oracle is not a claim about model intelligence. It is a stable baseline for
the harness itself: fixture isolation, policy gates, trace metrics, verifiers,
and category-level reports. Real providers can be compared against the same
suite through --model-profiles.
Near-term TODOs:
- Improve the TUI with better history navigation and richer trace inspection commands.
- Add richer failure replay fixtures under
replay/. - Add token/cost accounting when providers return usage data.