feat(benchmark): GH-24 système de benchmarking des modèles LLM par agent (Lot A — moteur backend) by simodev25 · Pull Request #25 · simodev25/KairosMesh

simodev25 · 2026-05-11T14:14:32Z

Résumé

Introduit un sous-système de benchmarking reproductible pour comparer objectivement les modèles LLM par agent de trading
4 nouvelles tables DB (fixtures, runs, cases, attempts) + moteur de scoring V1 purement objectif (5 métriques)
Réutilisation du cœur partagé du pipeline (ALL_AGENT_FACTORIES, build_toolkit(), build_model()) pour garantir la fidélité au comportement production
API REST complète sous /api/v1/benchmark/ (CRUD fixtures + lancement/requête de runs) + tâche Celery asynchrone
20 nouveaux tests unitaires, zéro régression sur les 632 tests backend existants

Contexte (pourquoi)

Problème : Kairos Mesh ne dispose d'aucun mécanisme objectif pour évaluer et comparer les performances des modèles LLM par agent. La sélection des modèles repose exclusivement sur l'intuition du développeur, sans données de comparaison reproductibles. Il est impossible de justifier un changement de modèle ou de détecter une régression de performance.

Solution (Lot A — backend uniquement) : Un sous-système de benchmarking dédié composé de :

Fixtures versionnées gelant inputs, prompts, skills, et configuration d'outils avec hash d'intégrité
Moteur d'exécution réutilisant le cœur partagé du pipeline (agents, toolkit, model factory, formatter)
Scoring V1 purement objectif (5 métriques : validité schéma, complétude, conformité outils, cohérence références, stabilité)
API REST suivant les patterns existants avec filtres par agent, modèle et fixture
Exécution asynchrone via Celery pour runs longs (débat, pipeline complet)

Non-objectifs explicites (lots futurs) :

Dashboard frontend (Lot B)
Scoring subjectif LLM-juge (Lot C)
Comparaison statistique avancée (Lot C)

Changements

Base de données (migration Alembic)

Migration 0013_gh24_benchmark_tables.py : 4 nouvelles tables
- benchmark_fixture : fixtures versionnées avec hash d'intégrité
- benchmark_run : métadata de run + configuration modèle
- benchmark_case : cas individuel d'un run (agent + fixture + scénario)
- benchmark_attempt : tentative unique avec output, scores et latence
Clés étrangères vers analysis_run (optionnelle, pour contexte de run réel)
Contrainte d'unicité : (run, agent, fixture) → un seul case par triplet

Modèles SQLAlchemy

BenchmarkFixture : gestion des fixtures versionnées
BenchmarkRun : métadata + provider/modèle/température
BenchmarkCase : liaison fixture ↔ run ↔ agent
BenchmarkAttempt : outputs + scores individuels + latence

Schémas Pydantic

BenchmarkFixtureCreate, BenchmarkFixtureUpdate, BenchmarkFixtureResponse
BenchmarkRunCreate, BenchmarkRunResponse, BenchmarkRunDetailResponse
BenchmarkAttemptResponse, BenchmarkCaseResponse
BenchmarkLaunchRequest (lancement run depuis fixtures)

Services

fixtures_service.py : CRUD fixtures avec calcul hash d'intégrité
runs_service.py : CRUD runs, requête de résultats avec filtres
engine.py : orchestration exécution benchmark
- Réutilise ALL_AGENT_FACTORIES, build_toolkit(), build_model(), build_formatter()
- Gestion de la règle de saut du débat
- Traçage des appels LLM via llm_call_log
scenarios.py : 3 types de scénarios
- single-agent : un agent seul
- debate-bundle : bullish + bearish + trader (modérateur)
- full-pipeline : pipeline complet 4 phases (8 agents)
scoring_v1.py : 5 métriques objectives
- schema_validity : sortie conforme au schéma Pydantic attendu
- completeness : présence des champs obligatoires
- tool_policy_compliance : respect des contraintes preset_kwargs / force_kwargs
- reference_consistency : cohérence des références (ex. ticker dans les outputs multi-agents)
- stability : répétabilité (outputs identiques avec mêmes inputs/température=0)
agent_output_registry.py : mapping agent → schéma Pydantic attendu

API REST (`/api/v1/benchmark/`)

POST /benchmark/fixtures/ : créer fixture
GET /benchmark/fixtures/ : lister fixtures (filtres : agent, version)
GET /benchmark/fixtures/{id} : détail fixture
PUT /benchmark/fixtures/{id} : mettre à jour fixture
DELETE /benchmark/fixtures/{id} : supprimer fixture
POST /benchmark/runs/launch : lancer run (synchrone ou async via Celery)
GET /benchmark/runs/ : lister runs (filtres : agent, modèle, provider, état)
GET /benchmark/runs/{id} : détail run avec tentatives et scores

Tâche Celery

benchmark_task.py : run_benchmark_task(run_id) pour exécution asynchrone longue durée
Mise à jour du statut run : pending → running → completed / failed

Infrastructure

Ajout de BENCHMARK_MAX_ATTEMPTS_PER_CASE dans config
Injection de call_context dans base_llm_helpers.py pour traçage des appels LLM

Tests

20 nouveaux tests unitaires couvrant :
- Modèles DB (test_benchmark_models_phase1.py)
- Schémas Pydantic (test_benchmark_schema_phase2.py)
- Engine + scénarios (test_benchmark_engine_scenarios_unit.py)
- Scoring V1 (test_benchmark_scoring_v1_unit.py)
- Traçabilité (test_benchmark_traceability_phase4.py)
- API + services (test_benchmark_api_and_services_phase5.py)
- Tâche Celery (test_benchmark_task_phase6.py)
Zéro régression : 632 tests backend existants passent

Documentation

docs/development-guide.md mis à jour avec section benchmark
Artefacts de changement dans .samourai/docai/changes/2026-05/2026-05-11--GH-24--agent-benchmark-system-lot-a/
- Spec complète (chg-GH-24-spec.md)
- Plan d'implémentation (chg-GH-24-plan.md)
- Plan de tests (chg-GH-24-test-plan.md)

Issue liée

Closes #24

Tests exécutés

cd backend && pytest -q

Résultat : ✅ 652 tests passent (632 existants + 20 nouveaux), 0 échec, 0 régression

Risques

Risque moyen : Nouveau sous-système touchant le cœur partagé du pipeline
- Mitigation : Réutilisation stricte de ALL_AGENT_FACTORIES, build_toolkit(), build_model() sans modification des factories de production
- Mitigation : Isolation du code benchmark sous backend/app/services/benchmark/ et backend/app/api/routes/benchmark.py
- Mitigation : 20 tests unitaires couvrant tous les chemins critiques
- Mitigation : Zéro régression prouvée par exécution complète de la suite de tests backend
Coût LLM : Pas de limite automatique
- Mitigation : Configuration manuelle max_attempts_per_case pour contrôler le volume d'appels
- Mitigation : Traçage complet des appels via llm_call_log avec context_type=benchmark
Compatibilité backward : Aucun breaking change
- Les tables existantes ne sont pas modifiées
- L'API REST suit les patterns existants (/api/v1/)
- Aucune dépendance cyclique introduite

Checklist

Aucun secret ajouté
Tests pertinents passent (20 nouveaux + 632 existants)
Documentation mise à jour (docs/development-guide.md)
Breaking changes : aucun — ajout de capacité pure, zéro impact sur le pipeline de trading existant

Phase 1 for GH-24: add backend scaffolding including Alembic migration, ORM models, API route skeletons and unit tests to start the agent-benchmark system. Verification: unit tests added for models; alembic revision included; basic routing and services scaffolded for further implementation. Refs: GH-24

…p backend version Reconciliation of plan vs implementation (execution log added), update development documentation to include the new benchmark subsystem, and bump backend app version to 0.2.0 (minor) to reflect delivered Lot A changes. Verification: updated plan shows PASS/PARTIAL criteria and targeted unit tests for benchmark features reported as passing.

simodev25 added 5 commits May 11, 2026 15:14

docs(change-spec): add spec for GH-24

1b328fd

docs(change-plan): add implementation plan for GH-24

305f488

docs(change-test-plan): add test plan for GH-24

72743aa

simodev25 mentioned this pull request May 11, 2026

feat(GH-26): Agent Benchmark System — Lot B : Dashboard frontend #27

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): GH-24 système de benchmarking des modèles LLM par agent (Lot A — moteur backend)#25

feat(benchmark): GH-24 système de benchmarking des modèles LLM par agent (Lot A — moteur backend)#25
simodev25 wants to merge 5 commits into
mainfrom
feat/GH-24/agent-benchmark-system-lot-a

simodev25 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

simodev25 commented May 11, 2026

Résumé

Contexte (pourquoi)

Changements

Base de données (migration Alembic)

Modèles SQLAlchemy

Schémas Pydantic

Services

API REST (/api/v1/benchmark/)

Tâche Celery

Infrastructure

Tests

Documentation

Issue liée

Tests exécutés

Risques

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

API REST (`/api/v1/benchmark/`)