feat(GH-26): Agent Benchmark System — Lot B : Dashboard frontend#27
Open
simodev25 wants to merge 26 commits into
Open
feat(GH-26): Agent Benchmark System — Lot B : Dashboard frontend#27simodev25 wants to merge 26 commits into
simodev25 wants to merge 26 commits into
Conversation
Add a UI panel and page-state hooks to create benchmark fixtures, wire the frontend API client (createBenchmarkFixture) and include the change plan. This commit implements the form, state management, types and API call used to POST /benchmark/fixtures from the frontend.
Add realistic, per-agent input presets used by the benchmark fixture creator so tests and manual runs better reflect production agent inputs; updates benchmark page and creation panel to use the new presets. Verified: frontend builds locally and UI interactions for fixture creation (manual smoke).
AgentScope tracing expects a Msg with .role and .get_content_blocks(), not a plain string. Build a rich Msg from fixture inputs (context, news_context, portfolio_state, etc.) matching the real pipeline format.
Mirror the real pipeline's _msg_to_dict logic: check metadata first,
then parse JSON from text content, then try content dict. Previous
version returned empty {'text': ''} when metadata was None/empty.
Ajoute des logs plus verbeux dans le moteur de benchmark et les scénarios pour faciliter le débogage et l'analyse des résultats de GH-24. Vérifié localement : exécution des tâches de benchmark et observation du nouveau niveau de détail des messages de log.
…de priority Favor rendering system prompts from the prompts DB for benchmark agents, while still allowing a fixture-level `system_prompt` to override when present. This fix ensures benchmark runs use the authoritative prompt source (DB) and keeps the previous behaviour of explicit fixture overrides. Why: benchmark scenarios must reflect the same agent system prompts as the production agentscope registry so results are consistent across runs. What: use PromptTemplateService.render(...) when no fixture override exists; add unit tests to verify DB-rendering and fixture override priority. Tests: added unit tests covering DB rendering and override behaviour. No breaking changes.
DeepSeek models return structured JSON inside ThinkingBlocks, not TextBlocks. get_text_content() ignores thinking blocks, so the extractor was getting empty text and scoring 0.00 everywhere. New _extract_all_text_from_msg() reads ALL block types (text + thinking) before attempting JSON parsing. Added 2 tests covering thinking-only and mixed block scenarios.
Add non-blocking debug trace dumps for benchmark attempts and a run-level summary JSON so runs can be inspected without interrupting execution. Files: - backend/app/core/config.py - backend/app/services/benchmark/scenarios.py - backend/app/services/benchmark/engine.py - docker-compose.yml - .samourai/docai/changes/2026-05/2026-05-11--GH-24--agent-benchmark-system-lot-a/chg-GH-24-plan.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Résumé
Dashboard frontend React pour le système de benchmark des agents de trading (Lot B). Repose sur l API REST livrée par le Lot A (GH-24, PR #25).
/benchmarkintégrée au SPA existant avec navigation sidebarChangements
frontend/src/types/benchmark.ts— types TypeScript (fixtures, runs, cases, attempts, scores V1)frontend/src/api/client.ts— 6 méthodes API benchmark ajoutéesfrontend/src/pages/BenchmarkPage.tsx— page complète avec 5 vues (fixtures, lancement, résultats, comparaison, détail run)frontend/src/components/Layout.tsx— entrée navigation sidebar BENCHMARKfrontend/src/App.tsx— route/benchmarken lazy loading.samourai/docai/changes/— spec, plan, test plan, pm-notesIssue liée
Closes #26
Tests exécutés
Note : erreurs TS pré-existantes dans BacktestsPage, ConnectorsPage, OrdersPage, RunDetailPage — hors périmètre.
Risques
feat/GH-24(pasmain) — merger Agent Benchmark System — Lot A : moteur backend (fixtures, engine, scoring, API) #24 (PR feat(benchmark): GH-24 système de benchmarking des modèles LLM par agent (Lot A — moteur backend) #25) en premierChecklist