Skip to content

feat(ai): ContinuousEvalWorker — scheduled prod evals (Phase 12, AI-079 slice 5a)#370

Merged
mrviduus merged 1 commit into
mainfrom
phase12-continuous-eval
Jun 18, 2026
Merged

feat(ai): ContinuousEvalWorker — scheduled prod evals (Phase 12, AI-079 slice 5a)#370
mrviduus merged 1 commit into
mainfrom
phase12-continuous-eval

Conversation

@mrviduus

Copy link
Copy Markdown
Owner

Phase 12 (RLOps) — slice 5a: continuous eval (first half of the Phase-12 closer)

Automates the eval suite on a cadence so quality regressions on prod are caught without an admin clicking "run evals".

  • ContinuousEvalWorker (Api host BackgroundService, ~10min startup delay + hourly check): runs EvalSuiteRunner for the configured features when due (no run_type='scheduled' row newer than Eval:Scheduled:IntervalHours, default 24h), persists with the new eval_runs.run_type column (scheduled/manual).
  • Regression alert: emails the admin (via ResendEmailService, no-op if unset) when a feature's score drops ≥ RegressionDrop (default 0.5 on the 1-5 scale) vs the prior scheduled run (EvalRegressionDetector, pure/unit-tested).
  • OFF by default (Eval:Scheduled:Enabled=false) — it spends judge $ when on, so it also respects an optional eval.judge daily cap (fail-open).
  • Concurrency-safe: the in-process overlap guard the admin trigger used is extracted into a shared singleton IEvalRunGate (scheduled + admin runs can't collide) + a Postgres advisory lock for multi-replica. Worker never crashes the host.
  • New GET /admin/ai-quality/drift/eval-trend exposes the scheduled-only score trend (consumed by the Drift tab in slice 5b).

Safety / QA

architect → backend → adversarial QA (verdict SHIP). Two fixes applied: P1 — the advisory lock is released with CancellationToken.None so host shutdown can't leak it onto a pooled connection; P2 — the admin trigger releases the gate even on a synchronous setup failure (no permanent eval block). Migration AddEvalRunType (NOT NULL default 'manual', backfills existing rows). 780 unit tests green; full solution clean. OFF by default → zero behavior change on deploy.

Slice 5b (DriftDetectionWorker + drift table + Drift tab) follows and closes Phase 12.

🤖 Generated with Claude Code

…12 slice 5a)

Automates the eval suite on a cadence so quality regressions are caught
without an admin trigger.

- ContinuousEvalWorker (Api BackgroundService, ~10min startup + hourly check):
  runs EvalSuiteRunner when due (no scheduled run newer than
  Eval:Scheduled:IntervalHours, default 24h), persists with new run_type column,
  alerts admin on score drop >= RegressionDrop (default 0.5) vs prior scheduled
  run (EvalRegressionDetector, pure).
- OFF by default (Eval:Scheduled:Enabled=false); respects optional eval.judge
  daily cap (fail-open).
- Concurrency: shared IEvalRunGate singleton (extracted from the admin trigger's
  static guard) + Postgres advisory lock for multi-replica. Never crashes host.
- QA: P1 advisory-unlock uses CancellationToken.None (no leak on shutdown);
  P2 admin trigger releases gate on synchronous setup failure.
- GET /admin/ai-quality/drift/eval-trend (scheduled trend, for 5b Drift tab).
- Migration AddEvalRunType (backfills existing rows to 'manual').

architect -> backend -> adversarial QA (SHIP; P1+P2 fixed). 780 unit tests
green; solution clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mrviduus mrviduus merged commit 530c5b9 into main Jun 18, 2026
5 checks passed
@mrviduus mrviduus deleted the phase12-continuous-eval branch June 18, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant