feat(ai): ContinuousEvalWorker — scheduled prod evals (Phase 12, AI-079 slice 5a) by mrviduus · Pull Request #370 · mrviduus/textstack

mrviduus · 2026-06-18T17:30:40Z

Phase 12 (RLOps) — slice 5a: continuous eval (first half of the Phase-12 closer)

Automates the eval suite on a cadence so quality regressions on prod are caught without an admin clicking "run evals".

ContinuousEvalWorker (Api host BackgroundService, ~10min startup delay + hourly check): runs EvalSuiteRunner for the configured features when due (no run_type='scheduled' row newer than Eval:Scheduled:IntervalHours, default 24h), persists with the new eval_runs.run_type column (scheduled/manual).
Regression alert: emails the admin (via ResendEmailService, no-op if unset) when a feature's score drops ≥ RegressionDrop (default 0.5 on the 1-5 scale) vs the prior scheduled run (EvalRegressionDetector, pure/unit-tested).
OFF by default (Eval:Scheduled:Enabled=false) — it spends judge $ when on, so it also respects an optional eval.judge daily cap (fail-open).
Concurrency-safe: the in-process overlap guard the admin trigger used is extracted into a shared singleton IEvalRunGate (scheduled + admin runs can't collide) + a Postgres advisory lock for multi-replica. Worker never crashes the host.
New GET /admin/ai-quality/drift/eval-trend exposes the scheduled-only score trend (consumed by the Drift tab in slice 5b).

Safety / QA

architect → backend → adversarial QA (verdict SHIP). Two fixes applied: P1 — the advisory lock is released with CancellationToken.None so host shutdown can't leak it onto a pooled connection; P2 — the admin trigger releases the gate even on a synchronous setup failure (no permanent eval block). Migration AddEvalRunType (NOT NULL default 'manual', backfills existing rows). 780 unit tests green; full solution clean. OFF by default → zero behavior change on deploy.

Slice 5b (DriftDetectionWorker + drift table + Drift tab) follows and closes Phase 12.

🤖 Generated with Claude Code

…12 slice 5a) Automates the eval suite on a cadence so quality regressions are caught without an admin trigger. - ContinuousEvalWorker (Api BackgroundService, ~10min startup + hourly check): runs EvalSuiteRunner when due (no scheduled run newer than Eval:Scheduled:IntervalHours, default 24h), persists with new run_type column, alerts admin on score drop >= RegressionDrop (default 0.5) vs prior scheduled run (EvalRegressionDetector, pure). - OFF by default (Eval:Scheduled:Enabled=false); respects optional eval.judge daily cap (fail-open). - Concurrency: shared IEvalRunGate singleton (extracted from the admin trigger's static guard) + Postgres advisory lock for multi-replica. Never crashes host. - QA: P1 advisory-unlock uses CancellationToken.None (no leak on shutdown); P2 admin trigger releases gate on synchronous setup failure. - GET /admin/ai-quality/drift/eval-trend (scheduled trend, for 5b Drift tab). - Migration AddEvalRunType (backfills existing rows to 'manual'). architect -> backend -> adversarial QA (SHIP; P1+P2 fixed). 780 unit tests green; solution clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mrviduus merged commit 530c5b9 into main Jun 18, 2026
5 checks passed

mrviduus deleted the phase12-continuous-eval branch June 18, 2026 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ai): ContinuousEvalWorker — scheduled prod evals (Phase 12, AI-079 slice 5a)#370

feat(ai): ContinuousEvalWorker — scheduled prod evals (Phase 12, AI-079 slice 5a)#370
mrviduus merged 1 commit into
mainfrom
phase12-continuous-eval

mrviduus commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrviduus commented Jun 18, 2026

Phase 12 (RLOps) — slice 5a: continuous eval (first half of the Phase-12 closer)

Safety / QA

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant