Background
Spinning out of #162 (Apple Music database service). Today, when the chained refresh job (apple/refresh_catalog × 3 → apple/rebuild_index) fails, the only signal is:
- A
failed or dead row in research_jobs
- A line on the
/admin/apple-music-catalog "Recent Refresh Activity" panel
- An exception trace in the worker's logs
That means failures are silent until someone happens to load the admin page or grep through Render logs. With #178 landing, scheduled runs make this worse — a 3 AM failure could go unnoticed for a week.
This same gap exists for any worker job, not just Apple catalog refresh — but Apple catalog refresh is the most prominent case because the failure mode is "matchers run against stale data, silently degrade quality."
Goal
Notify a human when:
- A refresh-chain run fails (any of the four steps: download albums / songs / artists / rebuild_index)
- A run gets marked
dead by the janitor (stuck past STUCK_RUNNING_AFTER)
- (Optional / later) A scheduled run is overdue — e.g., no successful refresh in the last 14 days when one was expected weekly
Options
Channels — pick one or more:
- Email via SendGrid — credentials already configured (
SENDGRID_API_KEY), template-friendly, works with any email recipient. Best fit for a low-frequency signal like this.
- Slack webhook — needs a new env var (
SLACK_WEBHOOK_URL), more interactive (can include a "Retry" button via slash command). Overkill if it's a one-person project.
- Render notifications — Render has a built-in failure-notification system at the service level, but it fires on service crashes, not on individual job failures inside a long-running worker. Not a fit.
Trigger points — where the alert fires from:
- Inside
research_worker/loop.py when a job's status transitions to failed (final attempt) or dead (janitor-marked)
- Periodic check via the same scheduler that fires the catalog refresh — e.g. once per hour scan
research_jobs for status IN ('failed', 'dead') AND created_at > NOW() - INTERVAL '1 hour' and digest into one email
- A Postgres NOTIFY/LISTEN trigger that fires on row updates — overengineered for this scope
Recommendation: trigger #2 (periodic digest). Avoids storms during retry loops, batches multiple failures into one notification, easy to suppress duplicates.
Tasks
Acceptance
- A
failed or dead research_jobs row produces exactly one notification within the digest window
- No notification for transient retries — only after
attempts >= max_attempts
- Notification includes: source (
apple), job_type (refresh_catalog/rebuild_index), target_id, error message, link to the admin page
- Duplicate notifications suppressed (a digest run that finds nothing new is silent)
Out of scope
- "Stale catalog" alerting (separate signal: no successful refresh in N days). Easy to add later in the same module — flag for now.
- Per-job-type custom routing (e.g. "Apple failures go to one channel, Spotify to another"). One channel for all is fine until proven otherwise.
- A general-purpose monitoring system. We're solving the specific "refresh failed and nobody knows" problem; this isn't a Datadog substitute.
Background
Spinning out of #162 (Apple Music database service). Today, when the chained refresh job (
apple/refresh_catalog× 3 →apple/rebuild_index) fails, the only signal is:failedordeadrow inresearch_jobs/admin/apple-music-catalog"Recent Refresh Activity" panelThat means failures are silent until someone happens to load the admin page or grep through Render logs. With #178 landing, scheduled runs make this worse — a 3 AM failure could go unnoticed for a week.
This same gap exists for any worker job, not just Apple catalog refresh — but Apple catalog refresh is the most prominent case because the failure mode is "matchers run against stale data, silently degrade quality."
Goal
Notify a human when:
deadby the janitor (stuck pastSTUCK_RUNNING_AFTER)Options
Channels — pick one or more:
SENDGRID_API_KEY), template-friendly, works with any email recipient. Best fit for a low-frequency signal like this.SLACK_WEBHOOK_URL), more interactive (can include a "Retry" button via slash command). Overkill if it's a one-person project.Trigger points — where the alert fires from:
research_worker/loop.pywhen a job's status transitions tofailed(final attempt) ordead(janitor-marked)research_jobsforstatus IN ('failed', 'dead') AND created_at > NOW() - INTERVAL '1 hour'and digest into one emailRecommendation: trigger #2 (periodic digest). Avoids storms during retry loops, batches multiple failures into one notification, easy to suppress duplicates.
Tasks
core/job_alerting.pymodule with a single entry pointsend_failure_digest(). Should know:research_jobsrows transitioned tofailed/deadsince last digestpayload.chain_started_atto group)research_jobsrow of its own — meta!), or as a Render cron alongside the scheduled refreshlast_alerted_aton the row, or a simple "high-water-mark" timestamp in a config tablepending/running/succeeded/canceled— onlyfailed(after final retry) anddeadAcceptance
failedordeadresearch_jobsrow produces exactly one notification within the digest windowattempts >= max_attemptsapple), job_type (refresh_catalog/rebuild_index),target_id, error message, link to the admin pageOut of scope