HarnessCoder

HarnessCoder is a local coding agent harness for real repository tasks. The 1.0 story is deliberately narrow:

event-sourced agent loop
policy-gated tools
trace/replay/eval
context governance: memory, compression, and RepoMap

It is not a fork of CoreCoder, not a smaller LangGraph clone, and not a web UI. The first goal is a controllable runtime that can run an agent loop, gate tool execution with policy, and write every important decision into a replayable JSONL trace.

The core loop is dynamic:

state -> model decides action -> policy checks -> tool executes
      -> observation appended -> state updated -> model decides again

That shape matters because coding tasks are rarely a fixed DAG. The useful next step depends on the current repo, tool observations, failures, test output, and the model's evolving plan. A DAG or LangGraph-style workflow can be useful for the eval pipeline around the agent, but the agent itself should remain a policy-gated loop.

Current Status

Version 1.0.2 is a runnable local runtime with real bugfix and minimal greenfield eval loops, HC-Bench-20, trace replay, eval reporting, model-profile comparison, context-governed prompt assembly, task-local memory, compression metrics, lightweight RepoMap, checkpoint/resume support, and a large-output artifact store for audit/replay. It includes:

A ScriptedModel that simulates model actions without calling a real LLM.
Tool execution for:
- read_file(path, offset=0, limit=200)
- search_code(query, path=".")
- repo_map(query=None, max_tokens=1200, refresh=false)
- write_file(path, content, overwrite=false)
- edit_file(path, old, new)
- run_tests(cmd=None, timeout=60)
- run_command(cmd, timeout=30)
A minimal policy gate before every tool call.
JSONL traces under .harnesscoder/runs/<run_id>/trace.jsonl.
Large tool observations are previewed in trace/model context and persisted under the run's artifacts/ directory with size and hash metadata.
context_packed, checkpoint_created, run_resumed, and test_result events for reliability-oriented replay.
repo_map_built and repo_map_used events for repository-level context governance.
Trace replay summaries through python -m harnesscoder.replay.
A minimal eval harness that runs cases, executes tests, scores results, and renders a Markdown report.
Fixture-backed bugfix evals that copy a repo into .harnesscoder/eval-workspaces/... before editing it.
A greenfield eval that starts from a nearly empty fixture and creates source plus tests from scratch.
Case-level allowed_tools, step_budget, and verifier fields inspired by benchmark harnesses such as Pico.
Model profiles and Markdown eval matrices for comparing the same cases across providers.
HC-Bench-20: 20 fixture-backed cases across bugfix, recovery, greenfield, context-governance, and policy/safety categories.
A deterministic hc-bench-oracle provider that proves the benchmark and report pipeline are solvable before comparing real models.
CLI entrypoints:

python -m harnesscoder "看一下这个 repo 是做什么的"
python -m harnesscoder --replay .harnesscoder/runs/<run_id>/trace.jsonl
python -m harnesscoder --resume .harnesscoder/runs/<run_id>/checkpoint.json
python -m harnesscoder --eval eval/cases.json
python -m harnesscoder --provider hc-bench-oracle --eval eval/hc_bench_20.json

The scripted model currently performs a small repo-orientation pass: search for project mentions, read README.md, list files, and then produce a final answer.

TUI

HarnessCoder also has a lightweight standard-library terminal UI:

python -m harnesscoder --tui

Inside the TUI, send a normal message to run the agent and write a new trace. The UI keeps refreshing while a run is active, shows the latest trace event in the status area, and folds the header on narrow or short terminals. Use slash commands for direct tools and runtime controls:

/help
/status
/model your-model-name
/model scripted
/provider openai-codex
/base-url https://your-openai-compatible-endpoint.example
/read README.md
/search HarnessCoder
/repo-map HarnessCoder
/edit README.md old new
/test python -m unittest discover -s tests
/run git status --short
/trace latest

The current TUI is intentionally small: it is a runnable control surface for the runtime and eval harness, not a full Claude Code clone.

Context Governance

HarnessCoder's context governance has three task-local layers:

Packed context summarizes hot observations, cold trace history, modified files, and budget.
Working memory stores task-scoped facts such as failing tests, explored files, relevant symbols, patch summaries, verified facts, and open questions.
RepoMap builds a bounded repository index from Python AST symbols, imports, classes, functions, and fallback regex symbols for non-Python text files.

Use prompt modes to ablate these layers:

python -m harnesscoder --context-mode none "inspect this repo"
python -m harnesscoder --context-mode pack "inspect this repo"
python -m harnesscoder --context-mode memory "inspect this repo"

RepoMap injection is enabled by default for pack and memory modes and can be disabled independently:

python -m harnesscoder \
  --context-mode pack \
  --repo-map-mode none \
  "inspect this repo"

OpenAI-Compatible Providers

The MVP includes two optional OpenAI-compatible real-model providers:

openai-codex calls a Responses API endpoint at /responses.
openai-chat calls a Chat Completions endpoint at /chat/completions.

Both providers ask the model to return a strict JSON action for the runtime to execute.

Keep secrets out of the repo. Configure the provider with environment variables or a local .env file:

export OPENAI_API_KEY="<your-api-key>"
export HARNESSCODER_OPENAI_BASE_URL="https://your-openai-compatible-endpoint.example/v1"
export HARNESSCODER_OPENAI_MODEL="your-codex-model-name"

python -m harnesscoder --provider openai-codex "看一下这个 repo 是做什么的"

If the base URL does not end in /v1, HarnessCoder appends /v1 before calling /responses or /chat/completions.

DeepSeek can be configured through the Chat Completions provider. Keep the API key in .env or your shell environment and reference the variable from models.toml:

[models.deepseek]
provider = "openai-chat"
model = "deepseek-v4-pro"
base_url = "https://api.deepseek.com"
api_key_env = "DEEPSEEK_API_KEY"
timeout = 120
max_output_tokens = 2000

Run a DeepSeek matrix:

python -m harnesscoder \
  --model-config models.toml \
  --model-profiles hc_bench_oracle,scripted,deepseek \
  --context-mode pack \
  --eval eval/hc_bench_20.json \
  --max-iterations 8 \
  --eval-report .harnesscoder/reports/hc-bench-20-deepseek-matrix.md

When launched from a repo, the CLI auto-loads .env from the current directory and from --cwd if it is different. Existing shell environment variables win over .env values. OPENAI_MODEL is also accepted as a fallback for HARNESSCODER_OPENAI_MODEL.

Trace Shape

Each run writes event records with a timestamp, run id, and event type. The runtime trace includes at least:

run_started
context_packed
model_action
policy_decision
tool_result
test_result
state_updated
checkpoint_created
run_resumed
run_finished

These traces are intentionally append-only JSONL so later replay and eval code can consume them without depending on in-memory state.

Developer Process Notes

See docs/development-process.md for the running engineering log: design decisions, bugs encountered during real provider integration, fixes, and interview-ready talking points.

For interview-facing material, see docs/showcase.md and docs/architecture.md. For release checks, see docs/release-checklist.md and docs/spec-1.0.0.md. The 1.0.1 evaluation tightening is scoped in docs/spec-1.0.1.md, and the 1.0.2 observation artifact store is scoped in docs/spec-1.0.2.md.

Replay And Eval

Replay loads a trace and reconstructs a structured summary:

python -m harnesscoder.replay .harnesscoder/runs/<run_id>/trace.jsonl
python -m harnesscoder --replay .harnesscoder/runs/<run_id>/trace.jsonl

Resume continues an interrupted run from the saved checkpoint:

python -m harnesscoder --resume .harnesscoder/runs/<run_id>/checkpoint.json

Eval stays workflow-shaped around the dynamic agent loop:

setup repo -> run agent -> run tests -> collect trace -> score -> report

Run the local smoke eval:

python -m harnesscoder --eval eval/cases.json

Run one named model profile:

python -m harnesscoder \
  --model-profile scripted \
  --eval eval/cases.json

Run the real bugfix loop with an OpenAI-compatible model:

export OPENAI_API_KEY="<your-api-key>"
export HARNESSCODER_OPENAI_BASE_URL="https://your-openai-compatible-endpoint.example/v1"
export HARNESSCODER_OPENAI_MODEL="your-codex-model-name"

python -m harnesscoder \
  --provider openai-codex \
  --eval eval/bugfix_cases.json \
  --max-iterations 8 \
  --eval-report .harnesscoder/reports/bugfix-demo.md

eval/bugfix_cases.json uses examples/bugfix_demo/repo as a fixture. The eval runner copies it into an isolated .harnesscoder/eval-workspaces/... workspace before the agent edits files, so demo fixtures remain stable.

Run the minimal greenfield loop:

python -m harnesscoder \
  --provider openai-codex \
  --eval eval/greenfield_cases.json \
  --max-iterations 10 \
  --eval-report .harnesscoder/reports/greenfield-demo.md

eval/greenfield_cases.json starts from examples/greenfield_demo/repo, which contains no application code. The agent must create math_utils.py and test_math_utils.py, pass python -m unittest discover, and pass a separate verifier command. The case also declares allowed_tools and step_budget, so the eval contract is explicit instead of hidden in prose.

Compare profiles with an eval matrix:

cp models.example.toml models.toml
# Edit models.toml locally, then keep it out of git if it contains private endpoints.

python -m harnesscoder \
  --model-config models.toml \
  --model-profiles hc_bench_oracle,scripted,openai_codex,deepseek \
  --eval eval/hc_bench_20.json \
  --max-iterations 8 \
  --eval-report .harnesscoder/reports/hc-bench-20-real-matrix.md

The matrix report compares pass rate, test pass rate, verifier pass rate, average tool calls, repeated reads, invalid calls, policy denials, tool failures, memory/compression metrics, RepoMap use/injection metrics, observation artifact metrics, and failure categories. Each profile/case run still keeps its own trace and artifact directory. If a real-model profile cannot initialize, the matrix records the profile error instead of hiding the reason.

Compare context modes:

python -m harnesscoder \
  --model-config models.toml \
  --model-profiles deepseek \
  --context-mode none \
  --eval eval/hc_bench_20.json \
  --eval-report .harnesscoder/reports/hc-bench-20-real-none.md

python -m harnesscoder \
  --model-config models.toml \
  --model-profiles deepseek \
  --context-mode pack \
  --eval eval/hc_bench_20.json \
  --eval-report .harnesscoder/reports/hc-bench-20-real-pack.md

python -m harnesscoder \
  --model-config models.toml \
  --model-profiles deepseek \
  --context-mode memory \
  --eval eval/hc_bench_20.json \
  --eval-report .harnesscoder/reports/hc-bench-20-real-memory.md

Compare RepoMap injection:

python -m harnesscoder \
  --model-config models.toml \
  --model-profiles deepseek \
  --context-mode pack \
  --repo-map-mode none \
  --eval eval/hc_bench_20.json \
  --eval-report .harnesscoder/reports/hc-bench-20-without-repo-map.md

python -m harnesscoder \
  --model-config models.toml \
  --model-profiles deepseek \
  --context-mode pack \
  --repo-map-mode auto \
  --eval eval/hc_bench_20.json \
  --eval-report .harnesscoder/reports/hc-bench-20-with-repo-map.md

Run HC-Bench-20 with the deterministic local oracle:

python -m harnesscoder \
  --provider hc-bench-oracle \
  --eval eval/hc_bench_20.json \
  --max-iterations 8 \
  --eval-report .harnesscoder/reports/hc-bench-20-oracle.md

HC-Bench-20 is the 0.7.0 interview benchmark. It contains 20 local cases:

7 bugfix cases for business-like defects.
3 recovery cases that require a failing test and a second fix.
5 greenfield cases that create modules and tests through write_file.
2 context cases that reward search-first, bounded reads in large files.
3 policy cases for path traversal, command injection, and dangerous command denial.

The oracle is not a claim about model intelligence. It is a stable baseline for the harness itself: fixture isolation, policy gates, trace metrics, verifiers, and category-level reports. Real providers can be compared against the same suite through --model-profiles.

Near-term TODOs:

Improve the TUI with better history navigation and richer trace inspection commands.
Add richer failure replay fixtures under replay/.
Add token/cost accounting when providers return usage data.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
docs		docs
eval		eval
examples		examples
harnesscoder		harnesscoder
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
models.example.toml		models.example.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HarnessCoder

Current Status

TUI

Context Governance

OpenAI-Compatible Providers

Trace Shape

Developer Process Notes

Replay And Eval

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HarnessCoder

Current Status

TUI

Context Governance

OpenAI-Compatible Providers

Trace Shape

Developer Process Notes

Replay And Eval

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages