A pytest-style evaluation framework for LLM outputs. Define test suites in YAML, run them against any model, grade outputs with deterministic criteria or an LLM judge, and compare results across models.
npm install
npm run build
export ANTHROPIC_API_KEY=sk-ant-...
./dist/cli.js run examples/summarization.yamlOr install globally after building:
npm link
eval run examples/summarization.yamlRun a test suite and display results in a table.
eval run examples/summarization.yaml
eval run examples/summarization.yaml --model claude-sonnet-4-6
eval run examples/summarization.yaml --watch # re-run on file save
eval run examples/summarization.yaml --no-cache # skip semantic cache
eval run examples/summarization.yaml --verbose # show full outputs + judge reasoning
eval run examples/summarization.yaml --json out.json # also write raw JSON to a path
eval run examples/summarization.yaml --dataset examples/datasets/prompts.jsonl # override datasetRun the same suite against multiple models and print a side-by-side comparison.
eval compare examples/summarization.yaml \
--models claude-haiku-4-5,claude-sonnet-4-6,claude-opus-4-8
eval compare examples/summarization.yaml \
--models gpt-4o-mini,gpt-4o \
--provider openai
# Mix providers with provider/model syntax
eval compare examples/summarization.yaml \
--models anthropic/claude-haiku-4-5,gemini/gemini-2.0-flash,ollama/llama3Compare two saved result files and report regressions and improvements. Exits with code 1 if any regressions are found — useful in CI.
eval diff results/2024-01-01_suite.json results/2024-01-02_suite.json
eval diff baseline.json candidate.json --format jsonList stored results from ./results/.
eval report
eval report --last 5
eval report --suite summarizationSpin up a local web dashboard to visualize and compare eval results.
eval dashboard
eval dashboard --port 8080
eval dashboard --results-dir ./my-resultsOpens a browser at http://localhost:3000. The dashboard shows run history, per-case breakdowns, and a side-by-side comparison view with a regression tab.
Show configured providers and their API key status.
eval providersname: "My Eval Suite"
description: "Optional description"
provider: anthropic # anthropic | openai | gemini | ollama
model: claude-haiku-4-5 # override per-suite
system_prompt: "You are..." # optional system prompt
temperature: 0.0 # optional, 0-2
max_tokens: 1024 # default 1024
# Optional: stream cases from a .jsonl dataset file
dataset: examples/datasets/prompts.jsonl
dataset_limit: 100 # cap total rows processed
dataset_sample: 20 # random sample of N rows
cases:
- id: "unique-case-id" # optional, auto-generated if omitted
prompt: "Your prompt here"
criteria:
- type: exact_match
value: "Expected output"
case_sensitive: false
- type: contains
value: "keyword"
case_sensitive: false
- type: max_words
value: 50
- type: regex
value: "^\\d{4}-\\d{2}-\\d{2}$"
flags: ""
- type: llm_judge
rubric: "The response should be polite and answer the question directly."
pass_threshold: 3 # 1-5, default 3
model: claude-opus-4-8 # optional judge model override
- type: code_execution
language: python # python | javascript | bash
test_code: "assert solution(2, 3) == 5"
expected_output: "" # optional: assert stdout matches
timeout_ms: 10000 # default 10000Use turns instead of prompt to evaluate multi-step conversations. Set content: null on assistant turns the model should fill in — the last null turn is the one evaluated by graders.
cases:
- id: remember-name
turns:
- role: user
content: "My name is Ivan. Please remember it."
- role: assistant
content: null # model responds here
- role: user
content: "What is my name?"
- role: assistant
content: null # this turn is evaluated
criteria:
- type: contains
value: "Ivan"Point the suite at a .jsonl file. Each line is a JSON object whose keys become {{variable}} substitutions in prompt and criteria fields.
dataset: examples/datasets/coding-problems.jsonl
dataset_limit: 50
cases:
- id: "coding-{{id}}"
prompt: |
Solve the following Python problem. Write only the function.
{{problem}}
criteria:
- type: code_execution
language: python
test_code: "{{test_code}}"| Type | Description | Pass condition |
|---|---|---|
exact_match |
String equality (trimmed) | Output equals value |
contains |
Substring check | Output contains value |
max_words |
Word count limit | Word count ≤ value |
regex |
Regular expression test | Pattern matches output |
llm_judge |
Second LLM scores 1-5 | Score ≥ pass_threshold (default 3) |
code_execution |
Runs extracted code + optional test assertions | Code exits 0 and assertions pass |
A case passes only when all criteria pass.
| Provider | Key env var | Notes |
|---|---|---|
anthropic |
ANTHROPIC_API_KEY |
Default provider |
openai |
OPENAI_API_KEY |
GPT-4o, o1, etc. |
gemini |
GEMINI_API_KEY |
Free tier available — get key at aistudio.google.com |
ollama |
— | Local models, no key required. Set OLLAMA_HOST to override http://localhost:11434 |
Copy .evalrc.json.example to .evalrc.json and adjust:
{
"default_provider": "anthropic",
"default_model": "claude-haiku-4-5",
"judge_model": "claude-opus-4-8",
"results_dir": "./results",
"cache_enabled": true
}API keys can be set in the config file or via environment variables — env vars take precedence:
| Env var | Config key |
|---|---|
ANTHROPIC_API_KEY |
anthropic_api_key |
OPENAI_API_KEY |
openai_api_key |
GEMINI_API_KEY |
gemini_api_key |
OLLAMA_HOST |
(env only) |
All (model, prompt, system_prompt, temperature, max_tokens) tuples are cached in .eval-cache/ as SHA-256-keyed JSON files. Subsequent runs with identical inputs return the cached response instantly at zero cost. Use --no-cache to force a fresh call.
Drop a .js file into a graders/ folder next to your eval YAML to add a custom grader type — no code changes needed.
// graders/sentiment.js
export default {
type: "sentiment",
run({ output, criteria }) {
const positive = ["good", "great", "excellent"].some(w => output.includes(w));
return { passed: criteria.expected === "positive" ? positive : !positive };
},
};criteria:
- type: sentiment
expected: positiveSee examples/plugins/sentiment_grader.js for a full example.
- Terminal: color-coded table with pass/fail, latency, cost, and per-criteria detail.
- JSON: every run is auto-saved to
./results/<timestamp>_<suite>.jsonfor later inspection or CI diffing. - Dashboard:
eval dashboardopens a browser UI with run history, charts, and a regression diff view.
YAML is human-readable and easy to version-control. It supports multiline strings naturally (pipe | syntax), which is essential for long prompts. Zod validation at load time gives clear error messages when the schema is wrong.
The LLM-as-judge grader uses a second API call with a strict rubric-scoring prompt, deliberately kept separate from the model under test. This avoids self-grading bias and lets you use the strongest available judge (default: claude-opus-4-8) even when evaluating cheaper models.
Exact-match caching is deterministic and free. Embedding-based "semantic" caching would add cost and complexity while introducing risk of cache collisions for subtly different prompts. The cache key covers (model, prompt, system_prompt, temperature, max_tokens) — everything that affects the output.
Binary pass/fail per case is easier to reason about than aggregated scores. The pass_rate gives a single headline number that maps cleanly to CI exit codes (exit 1 on any failure).
The judge receives the output and a rubric, responds with {"score": 1-5, "reasoning": "..."} in JSON, and the case passes when score >= pass_threshold. Default threshold is 3 (middle of the scale). The reasoning is surfaced in --verbose mode and in the saved JSON.
Commander has a simpler, more chainable API for typed TypeScript projects and requires no external type packages. Both would work equally well here.