feat(GH-26): Agent Benchmark System — Lot B : Dashboard frontend by simodev25 · Pull Request #27 · simodev25/KairosMesh

simodev25 · 2026-05-11T15:22:38Z

Résumé

Dashboard frontend React pour le système de benchmark des agents de trading (Lot B). Repose sur l API REST livrée par le Lot A (GH-24, PR #25).

Nouvelle page /benchmark intégrée au SPA existant avec navigation sidebar
Gestion des fixtures (liste, détail), lancement de runs avec sélection de modèle LLM
Visualisation des scores V1 par métrique et vue comparaison multi-modèles

Changements

frontend/src/types/benchmark.ts — types TypeScript (fixtures, runs, cases, attempts, scores V1)
frontend/src/api/client.ts — 6 méthodes API benchmark ajoutées
frontend/src/pages/BenchmarkPage.tsx — page complète avec 5 vues (fixtures, lancement, résultats, comparaison, détail run)
frontend/src/components/Layout.tsx — entrée navigation sidebar BENCHMARK
frontend/src/App.tsx — route /benchmark en lazy loading
.samourai/docai/changes/ — spec, plan, test plan, pm-notes

Issue liée

Closes #26

Tests exécutés

cd frontend && npx tsc --noEmit 2>&1 | grep benchmark
# Résultat : 0 erreur dans les fichiers GH-26

Note : erreurs TS pré-existantes dans BacktestsPage, ConnectorsPage, OrdersPage, RunDetailPage — hors périmètre.

Risques

PR basée sur feat/GH-24 (pas main) — merger Agent Benchmark System — Lot A : moteur backend (fixtures, engine, scoring, API) #24 (PR feat(benchmark): GH-24 système de benchmarking des modèles LLM par agent (Lot A — moteur backend) #25) en premier
Build frontend global cassé par erreurs TS pré-existantes — aucune régression introduite par ce changement
API backend benchmark doit être disponible au runtime

Checklist

Aucun secret ajouté
Zéro erreur TS dans les fichiers Agent Benchmark System — Lot B : Dashboard frontend #26
Design system respecté (tokens, JetBrains Mono, terminal-style)
Breaking changes : aucun

Add a UI panel and page-state hooks to create benchmark fixtures, wire the frontend API client (createBenchmarkFixture) and include the change plan. This commit implements the form, state management, types and API call used to POST /benchmark/fixtures from the frontend.

Add realistic, per-agent input presets used by the benchmark fixture creator so tests and manual runs better reflect production agent inputs; updates benchmark page and creation panel to use the new presets. Verified: frontend builds locally and UI interactions for fixture creation (manual smoke).

AgentScope tracing expects a Msg with .role and .get_content_blocks(), not a plain string. Build a rich Msg from fixture inputs (context, news_context, portfolio_state, etc.) matching the real pipeline format.

Mirror the real pipeline's _msg_to_dict logic: check metadata first, then parse JSON from text content, then try content dict. Previous version returned empty {'text': ''} when metadata was None/empty.

Ajoute des logs plus verbeux dans le moteur de benchmark et les scénarios pour faciliter le débogage et l'analyse des résultats de GH-24. Vérifié localement : exécution des tâches de benchmark et observation du nouveau niveau de détail des messages de log.

…de priority Favor rendering system prompts from the prompts DB for benchmark agents, while still allowing a fixture-level `system_prompt` to override when present. This fix ensures benchmark runs use the authoritative prompt source (DB) and keeps the previous behaviour of explicit fixture overrides. Why: benchmark scenarios must reflect the same agent system prompts as the production agentscope registry so results are consistent across runs. What: use PromptTemplateService.render(...) when no fixture override exists; add unit tests to verify DB-rendering and fixture override priority. Tests: added unit tests covering DB rendering and override behaviour. No breaking changes.

DeepSeek models return structured JSON inside ThinkingBlocks, not TextBlocks. get_text_content() ignores thinking blocks, so the extractor was getting empty text and scoring 0.00 everywhere. New _extract_all_text_from_msg() reads ALL block types (text + thinking) before attempting JSON parsing. Added 2 tests covering thinking-only and mixed block scenarios.

Add non-blocking debug trace dumps for benchmark attempts and a run-level summary JSON so runs can be inspected without interrupting execution. Files: - backend/app/core/config.py - backend/app/services/benchmark/scenarios.py - backend/app/services/benchmark/engine.py - docker-compose.yml - .samourai/docai/changes/2026-05/2026-05-11--GH-24--agent-benchmark-system-lot-a/chg-GH-24-plan.md

simodev25 added 26 commits May 11, 2026 16:23

docs(change-spec): add spec for GH-26

f92100b

docs(plan): add plan for GH-26

69280f1

docs(test-plan): add test plan for GH-26

3c2479a

feat(GH-26): ajouter types benchmark et client API frontend

99c9efb

feat(GH-26): ajouter route /benchmark et navigation sidebar

b6ee55d

feat(GH-26): implémenter la liste des fixtures benchmark

eabc16f

feat(GH-26): ajouter le formulaire de lancement de run benchmark

15c7f2c

feat(GH-26): afficher runs et résultats V1 benchmark

fb6ace2

feat(GH-26): ajouter le panneau détail run benchmark

0489a84

fix(GH-26): address code review iteration 1 remediation

f284c36

fix(GH-26): auto-apply missing llm_call_logs schema update in docker

ba2a8f4

feat(GH-26): add benchmark run results endpoint

bf8e243

fix(GH-26): add run polling, role gating, and enriched agent presets

6b7b222

fix(GH-26): match super-admin role format from backend

c8999ba

feat(GH-26): dynamic provider and model selection from system config

1cb5086

fix(GH-26): add benchmark queue to Celery worker

353d2d5

fix(GH-26): wrap benchmark context in AgentScope Msg object

418eb94

AgentScope tracing expects a Msg with .role and .get_content_blocks(), not a plain string. Build a rich Msg from fixture inputs (context, news_context, portfolio_state, etc.) matching the real pipeline format.

fix(GH-26): improve agent output extraction in benchmark scenarios

db91ca5

Mirror the real pipeline's _msg_to_dict logic: check metadata first, then parse JSON from text content, then try content dict. Previous version returned empty {'text': ''} when metadata was None/empty.

test(GH-26): couvrir le flux benchmark E2E et l'extraction payload

f3a7c74

chore(GH-26): add diagnostic logging for Msg structure in benchmark

5efd2e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(GH-26): Agent Benchmark System — Lot B : Dashboard frontend#27

feat(GH-26): Agent Benchmark System — Lot B : Dashboard frontend#27
simodev25 wants to merge 26 commits into
feat/GH-24/agent-benchmark-system-lot-afrom
feat/GH-26/benchmark-dashboard-frontend

simodev25 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

simodev25 commented May 11, 2026

Résumé

Changements

Issue liée

Tests exécutés

Risques

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant