AgentV companion eval project for a public coding/web financial research agent.
This repository is not a fork of Dexter and does not own Dexter's agent code or dataset. It uses Dexter's public src/evals/ dataset as a pinned benchmark fixture and golden-answer source so the AgentV Dashboard can show a realistic public domain-agent project.
See BASELINE_RESULTS.md for a public narrative report on the financial-research baseline runs, links to raw AgentV artifacts, and the cross-domain AgentV proof point alongside the legal/document-intelligence eval pack. The published dashboard-style static report is served at https://entityprocess.github.io/financial-research-evals/.
The first public demo is pinned to Dexter commit:
8d9419829f443f84b804d033bb2c3b1fbd788629
Dexter's own eval flow at that commit uses:
bun run src/evals/run.ts- optional sampling with
--sample N src/evals/dataset/finance_agent.csv- CSV columns:
Question,Answer,Question Type,Expert time (mins),Rubric - an LLM-as-judge correctness check, with CSV rubric metadata containing
correctnessandcontradictioncriteria
The committed AgentV eval keeps the question/answer fixture shape for every row in the pinned CSV: Dexter questions become AgentV input, and Dexter answers become expected_output. Dexter's runtime evaluator ignores the CSV Rubric column, but this project intentionally preserves those entries as native AgentV llm-grader rubrics. The shared prompt in prompts/dexter-grader.md receives AgentV's {{ rubrics_json }} and {{ metadata_json }} structured variables, so the eval does not duplicate question/answer data into grader-only payloads.
By default, the eval does not run Dexter. It runs a coding/web research agent against Dexter's public golden answers, so the demo does not require FINANCIAL_DATASETS_API_KEY. The real dexter-agent target remains available as an optional compatibility target for users who have Dexter's paid data prerequisites configured.
This repository mirrors the legal-style harness split:
- AgentV is only the eval harness: target selection, execution, results, and grading.
- Financial research skills/workflow are target-agent behavior maintained in
skills/,workflows/, andprompts/financial-research-system.md. - Dexter CSV and golden answers are benchmark fixture/provenance from the pinned public commit. They are not answer sources during benchmark execution.
dexter-agentremains an optional compatibility/reference target for users with Dexter's private prerequisites.
The default public financial-research-agent and codex targets embed a compact version of the reusable financial-research workflow in .agentv/targets.yaml and point back to the canonical prompt/skill files. This keeps the default target behavior explicit without adding provider-specific Dexter internals to AgentV core.
Dexter was inspected at commit 8d9419829f443f84b804d033bb2c3b1fbd788629 for portable workflow ideas, including financial research, DCF, memo, sentiment, tool-use, subagent, and finance-router guidance. That pinned checkout does not include a standalone LICENSE, NOTICE, or COPYING file, and package.json does not declare a license field, but Dexter's README license section states that the project is MIT licensed. Run bun run check:dexter-provenance with DEXTER_REPO_PATH set to refresh this evidence.
This repository still does not copy Dexter SKILL.md prose, source code, provider/API internals, or private data assumptions. The skill cards here are original, generic public-financial-research guidance; Dexter remains the dataset provenance and optional reference target.
Install AgentV separately.
For the default financial-research-agent target, configure a Codex-style coding agent plus a grader:
AGENT_TARGET=financial-research-agent
CODEX_EXECUTABLE=codex-eng
CODEX_MODEL=gpt-5.5
CODEX_REASONING_EFFORT=low
CODEX_WORKSPACE_DIR=.agentv/codex-workspaces
CODEX_LOG_DIR=.agentv/logs/codex
GRADER_TARGET=openai-grader
OPENAI_API_KEY=...
OPENAI_MODEL=gpt-5.5Clone and pin Dexter only when regenerating eval YAML from Dexter's CSV or when running the optional real dexter-agent target:
git clone https://github.com/virattt/dexter.git ../dexter
git -C ../dexter checkout 8d9419829f443f84b804d033bb2c3b1fbd788629
cd ../dexter
bun installCreate local env for this project:
cp .env.example .envFill in only local values in .env. Do not commit .env, resolved provider endpoints, API keys, Bitwarden output, or result-repo tokens.
Required variables for the default public-demo target:
AGENT_TARGET=financial-research-agentCODEX_EXECUTABLECODEX_MODELCODEX_WORKSPACE_DIRCODEX_LOG_DIRGRADER_TARGET- grader model variables for the selected grader target
- for
GRADER_TARGET=azure:AZURE_OPENAI_RESPONSES_BASE_URL,AZURE_OPENAI_API_KEY, andAZURE_DEPLOYMENT_NAME
Additional variables for optional AGENT_TARGET=dexter-agent:
DEXTER_REPO_PATHOPENAI_API_KEYFINANCIAL_DATASETS_API_KEYEXASEARCH_API_KEYorTAVILY_API_KEY
Preflight:
bun run setupThe public target prompt is defined in .agentv/targets.yaml and summarized in
prompts/financial-research-system.md; update the skill/workflow files first
when changing target-agent research behavior.
Run the full AgentV eval:
agentv eval evals/financial-research-agent.eval.yaml --targets .agentv/targets.yaml --target financial-research-agentDuring AgentV repository development, prefer the source CLI from the AgentV checkout:
bun /path/to/agentv/apps/cli/src/cli.ts eval financial-research-agent/evals/financial-research-agent.eval.yaml --targets financial-research-agent/.agentv/targets.yaml --target financial-research-agentFor quick verification, run one committed test by ID:
agentv eval evals/financial-research-agent.eval.yaml --targets .agentv/targets.yaml --target financial-research-agent --test-id us-steel-nippon-mergerTo run the real Dexter agent instead, use --target dexter-agent after setting
the optional Dexter variables above.
After updating DEXTER_REPO_PATH and DEXTER_COMMIT, regenerate the full AgentV eval from Dexter's public CSV:
bun run scripts/generate-eval-from-dexter.ts --out evals/financial-research-agent.eval.yamlUse --sample N --out <path> only for local experiments or quick generator checks; do not use a sampled file as the committed dataset boundary.
Review the generated eval before committing. The generator intentionally keeps the conversion conservative and AgentV-native: it preserves Dexter rubric entries as { operator, criteria }-style llm-grader rubric items, uses suite-level source metadata for the pinned CSV, and reuses prompts/dexter-grader.md by file reference.
Setup and target scripts print variable names and missing prerequisite guidance only. They must not print resolved secret values, private endpoints, or Bitwarden-derived output.
Public result synchronization belongs to the downstream financial-research-evals work. Before publishing any run artifact, scan it for API keys, provider endpoints, private paths, and sensitive data.
The Dexter adaptation uses AgentV's native llm-grader primitive. Each assertion references prompts/dexter-grader.md and passes Dexter CSV rubric entries through rubrics, preserving operator plus criteria so the prompt can distinguish correctness checks from contradiction guards. Suite-level metadata carries the pinned Dexter source fields, while per-test metadata only carries row-specific fields such as source_row, question_type, and expert_time_mins.