Failure alerting for research_worker jobs (esp. Apple Music catalog refresh)

## Background

Spinning out of #162 (Apple Music database service). Today, when the chained refresh job (`apple/refresh_catalog` × 3 → `apple/rebuild_index`) fails, the only signal is:

- A `failed` or `dead` row in `research_jobs`
- A line on the `/admin/apple-music-catalog` "Recent Refresh Activity" panel
- An exception trace in the worker's logs

That means failures are silent until someone happens to load the admin page or grep through Render logs. With #178 landing, scheduled runs make this worse — a 3 AM failure could go unnoticed for a week.

This same gap exists for any worker job, not just Apple catalog refresh — but Apple catalog refresh is the most prominent case because the failure mode is "matchers run against stale data, silently degrade quality."

## Goal

Notify a human when:

1. A refresh-chain run fails (any of the four steps: download albums / songs / artists / rebuild_index)
2. A run gets marked `dead` by the janitor (stuck past `STUCK_RUNNING_AFTER`)
3. _(Optional / later)_ A scheduled run is overdue — e.g., no successful refresh in the last 14 days when one was expected weekly

## Options

**Channels** — pick one or more:

- **Email via SendGrid** — credentials already configured (`SENDGRID_API_KEY`), template-friendly, works with any email recipient. Best fit for a low-frequency signal like this.
- **Slack webhook** — needs a new env var (`SLACK_WEBHOOK_URL`), more interactive (can include a "Retry" button via slash command). Overkill if it's a one-person project.
- **Render notifications** — Render has a built-in failure-notification system at the service level, but it fires on service crashes, not on individual job failures inside a long-running worker. Not a fit.

**Trigger points** — where the alert fires from:

1. Inside `research_worker/loop.py` when a job's status transitions to `failed` (final attempt) or `dead` (janitor-marked)
2. Periodic check via the same scheduler that fires the catalog refresh — e.g. once per hour scan `research_jobs` for `status IN ('failed', 'dead') AND created_at > NOW() - INTERVAL '1 hour'` and digest into one email
3. A Postgres NOTIFY/LISTEN trigger that fires on row updates — overengineered for this scope

Recommendation: **trigger #2** (periodic digest). Avoids storms during retry loops, batches multiple failures into one notification, easy to suppress duplicates.

## Tasks

- [ ] Decide channel (SendGrid email vs Slack)
- [ ] Decide trigger (per-failure inline vs periodic digest)
- [ ] Implement: write to a new `core/job_alerting.py` module with a single entry point `send_failure_digest()`. Should know:
  - which `research_jobs` rows transitioned to `failed`/`dead` since last digest
  - what the chain identity was (use `payload.chain_started_at` to group)
  - the most-recent error message per failure
- [ ] Wire it: either as a periodic worker job (a recurring `research_jobs` row of its own — meta!), or as a Render cron alongside the scheduled refresh
- [ ] Suppress duplicate notifications across digests using `last_alerted_at` on the row, or a simple "high-water-mark" timestamp in a config table
- [ ] Test: kill a refresh job mid-run, confirm a notification fires within the digest window
- [ ] Don't notify on `pending`/`running`/`succeeded`/`canceled` — only `failed` (after final retry) and `dead`

## Acceptance

- A `failed` or `dead` `research_jobs` row produces exactly one notification within the digest window
- No notification for transient retries — only after `attempts >= max_attempts`
- Notification includes: source (`apple`), job_type (`refresh_catalog`/`rebuild_index`), `target_id`, error message, link to the admin page
- Duplicate notifications suppressed (a digest run that finds nothing new is silent)

## Out of scope

- "Stale catalog" alerting (separate signal: no successful refresh in N days). Easy to add later in the same module — flag for now.
- Per-job-type custom routing (e.g. "Apple failures go to one channel, Spotify to another"). One channel for all is fine until proven otherwise.
- A general-purpose monitoring system. We're solving the specific "refresh failed and nobody knows" problem; this isn't a Datadog substitute.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure alerting for research_worker jobs (esp. Apple Music catalog refresh) #179

Background

Goal

Options

Tasks

Acceptance

Out of scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Failure alerting for research_worker jobs (esp. Apple Music catalog refresh) #179

Description

Background

Goal

Options

Tasks

Acceptance

Out of scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions