Skip to content

Failure alerting for research_worker jobs (esp. Apple Music catalog refresh) #179

@dprodger

Description

@dprodger

Background

Spinning out of #162 (Apple Music database service). Today, when the chained refresh job (apple/refresh_catalog × 3 → apple/rebuild_index) fails, the only signal is:

  • A failed or dead row in research_jobs
  • A line on the /admin/apple-music-catalog "Recent Refresh Activity" panel
  • An exception trace in the worker's logs

That means failures are silent until someone happens to load the admin page or grep through Render logs. With #178 landing, scheduled runs make this worse — a 3 AM failure could go unnoticed for a week.

This same gap exists for any worker job, not just Apple catalog refresh — but Apple catalog refresh is the most prominent case because the failure mode is "matchers run against stale data, silently degrade quality."

Goal

Notify a human when:

  1. A refresh-chain run fails (any of the four steps: download albums / songs / artists / rebuild_index)
  2. A run gets marked dead by the janitor (stuck past STUCK_RUNNING_AFTER)
  3. (Optional / later) A scheduled run is overdue — e.g., no successful refresh in the last 14 days when one was expected weekly

Options

Channels — pick one or more:

  • Email via SendGrid — credentials already configured (SENDGRID_API_KEY), template-friendly, works with any email recipient. Best fit for a low-frequency signal like this.
  • Slack webhook — needs a new env var (SLACK_WEBHOOK_URL), more interactive (can include a "Retry" button via slash command). Overkill if it's a one-person project.
  • Render notifications — Render has a built-in failure-notification system at the service level, but it fires on service crashes, not on individual job failures inside a long-running worker. Not a fit.

Trigger points — where the alert fires from:

  1. Inside research_worker/loop.py when a job's status transitions to failed (final attempt) or dead (janitor-marked)
  2. Periodic check via the same scheduler that fires the catalog refresh — e.g. once per hour scan research_jobs for status IN ('failed', 'dead') AND created_at > NOW() - INTERVAL '1 hour' and digest into one email
  3. A Postgres NOTIFY/LISTEN trigger that fires on row updates — overengineered for this scope

Recommendation: trigger #2 (periodic digest). Avoids storms during retry loops, batches multiple failures into one notification, easy to suppress duplicates.

Tasks

  • Decide channel (SendGrid email vs Slack)
  • Decide trigger (per-failure inline vs periodic digest)
  • Implement: write to a new core/job_alerting.py module with a single entry point send_failure_digest(). Should know:
    • which research_jobs rows transitioned to failed/dead since last digest
    • what the chain identity was (use payload.chain_started_at to group)
    • the most-recent error message per failure
  • Wire it: either as a periodic worker job (a recurring research_jobs row of its own — meta!), or as a Render cron alongside the scheduled refresh
  • Suppress duplicate notifications across digests using last_alerted_at on the row, or a simple "high-water-mark" timestamp in a config table
  • Test: kill a refresh job mid-run, confirm a notification fires within the digest window
  • Don't notify on pending/running/succeeded/canceled — only failed (after final retry) and dead

Acceptance

  • A failed or dead research_jobs row produces exactly one notification within the digest window
  • No notification for transient retries — only after attempts >= max_attempts
  • Notification includes: source (apple), job_type (refresh_catalog/rebuild_index), target_id, error message, link to the admin page
  • Duplicate notifications suppressed (a digest run that finds nothing new is silent)

Out of scope

  • "Stale catalog" alerting (separate signal: no successful refresh in N days). Easy to add later in the same module — flag for now.
  • Per-job-type custom routing (e.g. "Apple failures go to one channel, Spotify to another"). One channel for all is fine until proven otherwise.
  • A general-purpose monitoring system. We're solving the specific "refresh failed and nobody knows" problem; this isn't a Datadog substitute.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions