feat(ai): ContinuousEvalWorker — scheduled prod evals (Phase 12, AI-079 slice 5a)#370
Merged
Conversation
…12 slice 5a) Automates the eval suite on a cadence so quality regressions are caught without an admin trigger. - ContinuousEvalWorker (Api BackgroundService, ~10min startup + hourly check): runs EvalSuiteRunner when due (no scheduled run newer than Eval:Scheduled:IntervalHours, default 24h), persists with new run_type column, alerts admin on score drop >= RegressionDrop (default 0.5) vs prior scheduled run (EvalRegressionDetector, pure). - OFF by default (Eval:Scheduled:Enabled=false); respects optional eval.judge daily cap (fail-open). - Concurrency: shared IEvalRunGate singleton (extracted from the admin trigger's static guard) + Postgres advisory lock for multi-replica. Never crashes host. - QA: P1 advisory-unlock uses CancellationToken.None (no leak on shutdown); P2 admin trigger releases gate on synchronous setup failure. - GET /admin/ai-quality/drift/eval-trend (scheduled trend, for 5b Drift tab). - Migration AddEvalRunType (backfills existing rows to 'manual'). architect -> backend -> adversarial QA (SHIP; P1+P2 fixed). 780 unit tests green; solution clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 12 (RLOps) — slice 5a: continuous eval (first half of the Phase-12 closer)
Automates the eval suite on a cadence so quality regressions on prod are caught without an admin clicking "run evals".
ContinuousEvalWorker(Api hostBackgroundService, ~10min startup delay + hourly check): runsEvalSuiteRunnerfor the configured features when due (norun_type='scheduled'row newer thanEval:Scheduled:IntervalHours, default 24h), persists with the neweval_runs.run_typecolumn (scheduled/manual).ResendEmailService, no-op if unset) when a feature's score drops ≥RegressionDrop(default 0.5 on the 1-5 scale) vs the prior scheduled run (EvalRegressionDetector, pure/unit-tested).Eval:Scheduled:Enabled=false) — it spends judge $ when on, so it also respects an optionaleval.judgedaily cap (fail-open).IEvalRunGate(scheduled + admin runs can't collide) + a Postgres advisory lock for multi-replica. Worker never crashes the host.GET /admin/ai-quality/drift/eval-trendexposes the scheduled-only score trend (consumed by the Drift tab in slice 5b).Safety / QA
architect → backend → adversarial QA (verdict SHIP). Two fixes applied: P1 — the advisory lock is released with
CancellationToken.Noneso host shutdown can't leak it onto a pooled connection; P2 — the admin trigger releases the gate even on a synchronous setup failure (no permanent eval block). MigrationAddEvalRunType(NOT NULL default'manual', backfills existing rows). 780 unit tests green; full solution clean. OFF by default → zero behavior change on deploy.Slice 5b (DriftDetectionWorker + drift table + Drift tab) follows and closes Phase 12.
🤖 Generated with Claude Code